5.1 Azure AI Search — Fundamentals and Architecture

Key Takeaways

  • Azure AI Search (formerly Cognitive Search) provides full-text search, AI enrichment, vector search, and semantic ranking for enterprise data.
  • The core architecture consists of: data sources, indexers, skillsets (AI enrichment), indexes, and queries.
  • Indexers automatically pull data from supported data sources (Azure Blob Storage, SQL Database, Cosmos DB, Table Storage) on a schedule.
  • Skillsets define the AI enrichment pipeline — a sequence of built-in or custom skills that extract insights during indexing.
  • Vector search enables similarity-based retrieval using embeddings, which is essential for RAG (Retrieval-Augmented Generation) patterns.
Last updated: March 2026

Azure AI Search — Fundamentals and Architecture

Quick Answer: Azure AI Search provides full-text search with AI enrichment. The pipeline is: Data Source → Indexer → Skillset (AI enrichment) → Index → Query. Indexers pull data from Azure storage, SQL, and Cosmos DB. Skillsets enrich data with OCR, NER, key phrases, and custom skills. Vector search enables RAG patterns.

Azure AI Search Architecture

[Data Sources]           [AI Enrichment]           [Search]
┌─────────────┐     ┌─────────────────┐     ┌──────────────┐
│ Blob Storage│     │   Skillset      │     │ Search Index │
│ SQL Database│ ──▶ │ ┌─────────────┐ │ ──▶ │ ┌──────────┐ │ ──▶ [Client]
│ Cosmos DB   │     │ │ OCR         │ │     │ │ Full-text│ │
│ Table Store │     │ │ NER         │ │     │ │ Vector   │ │
└─────────────┘     │ │ Key Phrases │ │     │ │ Semantic │ │
    [Indexer]       │ │ Custom Skill│ │     │ └──────────┘ │
                    │ └─────────────┘ │     └──────────────┘
                    └─────────────────┘

Core Components

1. Data Sources

Supported data sources for automatic indexing:

Data SourceConnectorBest For
Azure Blob StorageBuilt-inDocuments, images, PDFs
Azure SQL DatabaseBuilt-inStructured relational data
Azure Cosmos DBBuilt-inNoSQL document data
Azure Table StorageBuilt-inKey-value data
Azure Data Lake Gen2Built-inLarge-scale data lakes
SharePointBuilt-inEnterprise documents

2. Indexers

Indexers automate data ingestion:

{
    "name": "my-blob-indexer",
    "dataSourceName": "my-blob-datasource",
    "targetIndexName": "my-search-index",
    "skillsetName": "my-ai-skillset",
    "schedule": {
        "interval": "PT2H"
    },
    "parameters": {
        "configuration": {
            "dataToExtract": "contentAndMetadata",
            "imageAction": "generateNormalizedImages",
            "parsingMode": "default"
        }
    },
    "fieldMappings": [
        {
            "sourceFieldName": "metadata_storage_name",
            "targetFieldName": "documentName"
        }
    ],
    "outputFieldMappings": [
        {
            "sourceFieldName": "/document/organizations",
            "targetFieldName": "organizations"
        }
    ]
}

Key Indexer Settings

SettingDescription
schedule.intervalHow often to run (e.g., PT2H = every 2 hours)
dataToExtract"contentAndMetadata" or "storageMetadata"
imageAction"generateNormalizedImages" to enable OCR on images
parsingMode"default", "json", "jsonArray", "jsonLines", "delimitedText"
fieldMappingsMap source fields to index fields (before enrichment)
outputFieldMappingsMap enrichment outputs to index fields (after enrichment)

3. Search Index

The search index defines the schema for searchable content:

{
    "name": "my-search-index",
    "fields": [
        {"name": "id", "type": "Edm.String", "key": true, "filterable": true},
        {"name": "content", "type": "Edm.String", "searchable": true, "analyzer": "en.microsoft"},
        {"name": "title", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true},
        {"name": "organizations", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "facetable": true},
        {"name": "keyPhrases", "type": "Collection(Edm.String)", "searchable": true, "filterable": true},
        {"name": "language", "type": "Edm.String", "filterable": true},
        {"name": "contentVector", "type": "Collection(Edm.Single)", "searchable": true, "dimensions": 1536, "vectorSearchProfile": "my-vector-profile"}
    ]
}

Field Attributes

AttributeDescriptionUse Case
searchableFull-text searchable with analyzerText content for keyword search
filterableCan be used in $filter expressionsCategory filtering, date ranges
sortableCan be used in $orderby expressionsSort by date, relevance, name
facetableCan be used for faceted navigationFilter sidebar (by category, author)
retrievableReturned in search resultsFields to display to the user

Vector Search

Vector search uses embeddings (numerical representations of text) to find semantically similar content:

Creating a Vector Search Index

{
    "name": "my-vector-index",
    "fields": [
        {"name": "id", "type": "Edm.String", "key": true},
        {"name": "content", "type": "Edm.String", "searchable": true},
        {
            "name": "contentVector",
            "type": "Collection(Edm.Single)",
            "searchable": true,
            "dimensions": 1536,
            "vectorSearchProfile": "my-vector-profile"
        }
    ],
    "vectorSearch": {
        "algorithms": [
            {
                "name": "my-hnsw-algo",
                "kind": "hnsw",
                "hnswParameters": {
                    "metric": "cosine",
                    "m": 4,
                    "efConstruction": 400,
                    "efSearch": 500
                }
            }
        ],
        "profiles": [
            {
                "name": "my-vector-profile",
                "algorithmConfigurationName": "my-hnsw-algo"
            }
        ]
    }
}

Vector Query

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.models import VectorizedQuery

search_client = SearchClient(
    endpoint="https://my-search.search.windows.net",
    index_name="my-vector-index",
    credential=AzureKeyCredential("<your-key>")
)

# Generate embedding for the query
query_embedding = get_embedding("What is machine learning?")  # From OpenAI

# Vector search
results = search_client.search(
    search_text=None,  # No text search, vector only
    vector_queries=[
        VectorizedQuery(
            vector=query_embedding,
            k_nearest_neighbors=5,
            fields="contentVector"
        )
    ]
)

for result in results:
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content'][:200]}")

Semantic Ranking

Semantic ranking uses a language model to re-rank search results by understanding query intent:

FeatureDescription
Semantic rankerRe-ranks top results using deep learning for relevance
Semantic captionsExtracts the most relevant passage from each result
Semantic answersExtracts a direct answer to the query from the top results

On the Exam: Know the difference between keyword search (BM25), vector search (embeddings), and semantic ranking (re-ranking). For RAG, hybrid search (keyword + vector) with semantic ranking provides the best results.

Test Your Knowledge

What is the correct order of components in the Azure AI Search indexing pipeline?

A
B
C
D
Test Your Knowledge

Which indexer parameter must be set to enable OCR processing of images embedded in documents?

A
B
C
D
Test Your Knowledge

What does semantic ranking do in Azure AI Search?

A
B
C
D
Test Your Knowledge

In a vector search query, what does the k_nearest_neighbors parameter specify?

A
B
C
D