5.1 Azure AI Search — Fundamentals and Architecture
Key Takeaways
- Azure AI Search (formerly Cognitive Search) provides full-text search, AI enrichment, vector search, and semantic ranking for enterprise data.
- The core architecture consists of: data sources, indexers, skillsets (AI enrichment), indexes, and queries.
- Indexers automatically pull data from supported data sources (Azure Blob Storage, SQL Database, Cosmos DB, Table Storage) on a schedule.
- Skillsets define the AI enrichment pipeline — a sequence of built-in or custom skills that extract insights during indexing.
- Vector search enables similarity-based retrieval using embeddings, which is essential for RAG (Retrieval-Augmented Generation) patterns.
Azure AI Search — Fundamentals and Architecture
Quick Answer: Azure AI Search provides full-text search with AI enrichment. The pipeline is: Data Source → Indexer → Skillset (AI enrichment) → Index → Query. Indexers pull data from Azure storage, SQL, and Cosmos DB. Skillsets enrich data with OCR, NER, key phrases, and custom skills. Vector search enables RAG patterns.
Azure AI Search Architecture
[Data Sources] [AI Enrichment] [Search]
┌─────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Blob Storage│ │ Skillset │ │ Search Index │
│ SQL Database│ ──▶ │ ┌─────────────┐ │ ──▶ │ ┌──────────┐ │ ──▶ [Client]
│ Cosmos DB │ │ │ OCR │ │ │ │ Full-text│ │
│ Table Store │ │ │ NER │ │ │ │ Vector │ │
└─────────────┘ │ │ Key Phrases │ │ │ │ Semantic │ │
[Indexer] │ │ Custom Skill│ │ │ └──────────┘ │
│ └─────────────┘ │ └──────────────┘
└─────────────────┘
Core Components
1. Data Sources
Supported data sources for automatic indexing:
| Data Source | Connector | Best For |
|---|---|---|
| Azure Blob Storage | Built-in | Documents, images, PDFs |
| Azure SQL Database | Built-in | Structured relational data |
| Azure Cosmos DB | Built-in | NoSQL document data |
| Azure Table Storage | Built-in | Key-value data |
| Azure Data Lake Gen2 | Built-in | Large-scale data lakes |
| SharePoint | Built-in | Enterprise documents |
2. Indexers
Indexers automate data ingestion:
{
"name": "my-blob-indexer",
"dataSourceName": "my-blob-datasource",
"targetIndexName": "my-search-index",
"skillsetName": "my-ai-skillset",
"schedule": {
"interval": "PT2H"
},
"parameters": {
"configuration": {
"dataToExtract": "contentAndMetadata",
"imageAction": "generateNormalizedImages",
"parsingMode": "default"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "documentName"
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/organizations",
"targetFieldName": "organizations"
}
]
}
Key Indexer Settings
| Setting | Description |
|---|---|
| schedule.interval | How often to run (e.g., PT2H = every 2 hours) |
| dataToExtract | "contentAndMetadata" or "storageMetadata" |
| imageAction | "generateNormalizedImages" to enable OCR on images |
| parsingMode | "default", "json", "jsonArray", "jsonLines", "delimitedText" |
| fieldMappings | Map source fields to index fields (before enrichment) |
| outputFieldMappings | Map enrichment outputs to index fields (after enrichment) |
3. Search Index
The search index defines the schema for searchable content:
{
"name": "my-search-index",
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "filterable": true},
{"name": "content", "type": "Edm.String", "searchable": true, "analyzer": "en.microsoft"},
{"name": "title", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true},
{"name": "organizations", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
{"name": "keyPhrases", "type": "Collection(Edm.String)", "searchable": true, "filterable": true},
{"name": "language", "type": "Edm.String", "filterable": true},
{"name": "contentVector", "type": "Collection(Edm.Single)", "searchable": true, "dimensions": 1536, "vectorSearchProfile": "my-vector-profile"}
]
}
Field Attributes
| Attribute | Description | Use Case |
|---|---|---|
| searchable | Full-text searchable with analyzer | Text content for keyword search |
| filterable | Can be used in $filter expressions | Category filtering, date ranges |
| sortable | Can be used in $orderby expressions | Sort by date, relevance, name |
| facetable | Can be used for faceted navigation | Filter sidebar (by category, author) |
| retrievable | Returned in search results | Fields to display to the user |
Vector Search
Vector search uses embeddings (numerical representations of text) to find semantically similar content:
Creating a Vector Search Index
{
"name": "my-vector-index",
"fields": [
{"name": "id", "type": "Edm.String", "key": true},
{"name": "content", "type": "Edm.String", "searchable": true},
{
"name": "contentVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"dimensions": 1536,
"vectorSearchProfile": "my-vector-profile"
}
],
"vectorSearch": {
"algorithms": [
{
"name": "my-hnsw-algo",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
}
}
],
"profiles": [
{
"name": "my-vector-profile",
"algorithmConfigurationName": "my-hnsw-algo"
}
]
}
}
Vector Query
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.models import VectorizedQuery
search_client = SearchClient(
endpoint="https://my-search.search.windows.net",
index_name="my-vector-index",
credential=AzureKeyCredential("<your-key>")
)
# Generate embedding for the query
query_embedding = get_embedding("What is machine learning?") # From OpenAI
# Vector search
results = search_client.search(
search_text=None, # No text search, vector only
vector_queries=[
VectorizedQuery(
vector=query_embedding,
k_nearest_neighbors=5,
fields="contentVector"
)
]
)
for result in results:
print(f"Score: {result['@search.score']}")
print(f"Content: {result['content'][:200]}")
Semantic Ranking
Semantic ranking uses a language model to re-rank search results by understanding query intent:
| Feature | Description |
|---|---|
| Semantic ranker | Re-ranks top results using deep learning for relevance |
| Semantic captions | Extracts the most relevant passage from each result |
| Semantic answers | Extracts a direct answer to the query from the top results |
On the Exam: Know the difference between keyword search (BM25), vector search (embeddings), and semantic ranking (re-ranking). For RAG, hybrid search (keyword + vector) with semantic ranking provides the best results.
What is the correct order of components in the Azure AI Search indexing pipeline?
Which indexer parameter must be set to enable OCR processing of images embedded in documents?
What does semantic ranking do in Azure AI Search?
In a vector search query, what does the k_nearest_neighbors parameter specify?