3.2 Embeddings & Vector Representation

Key Takeaways

  • An embedding is a fixed-length numeric vector that encodes the meaning of text; semantically similar text maps to nearby vectors.
  • Databricks Foundation Model APIs host embedding models (for example GTE and BGE families) that you call from SQL, notebooks, or serving endpoints.
  • The query embedder and the index embedder must be the same model; switching embedding models forces a full reindex because old and new vectors are not comparable.
  • Normalize embeddings before indexing when you want cosine-similarity behavior, so similarity depends on direction rather than raw magnitude.
  • For storage-optimized Vector Search endpoints with precomputed embeddings, the embedding dimension must be evenly divisible by 16.
Last updated: July 2026

Turning Text into Vectors

Once documents are chunked, each chunk must become something a machine can compare for meaning. That representation is an embedding: a fixed-length vector of floating-point numbers that encodes the semantic content of a piece of text. The defining property is geometric — text with similar meaning maps to vectors that sit close together in the embedding space, while unrelated text lands far apart. This is what makes semantic search possible: instead of matching literal keywords, you embed the user's query, then find the chunk vectors nearest to it.

What an embedding actually is

An embedding model reads a string and outputs a vector of a fixed dimensionality — a specific length such as 384, 768, or 1024 numbers, determined entirely by the model. Every chunk you index and every query you run is projected into that same fixed space. Two consequences follow that the exam tests repeatedly. First, the dimensionality is a property of the model, not something you choose per document. Second, the query embedder and the index embedder must be the same model — vectors from two different models are not comparable, so mixing them produces meaningless similarity scores. Switching embedding models therefore forces a full reindex: every stored vector must be regenerated with the new model before queries will work.

Embedding models on Databricks

Databricks exposes embedding models through Foundation Model APIs, the platform's hosted access to base models. You can call a pay-per-token embedding model (the GTE and BGE families are common defaults), or provision your own embedding model behind a Model Serving endpoint. Either way, the embedding endpoint is a callable service: you send text, it returns vectors. Choose an embedding model based on language coverage, dimensionality, latency, and — most importantly — quality on your own evaluation set, because embedding quality varies by domain and no single model wins everywhere.

Similarity metrics

How do you measure "nearby"? Three metrics dominate:

MetricMeasuresNotes
Cosine similarityAngle between two vectorsIgnores magnitude; the usual default for text. Range roughly -1 to 1
Dot product (inner product)Projection; combines angle and magnitudeEquivalent to cosine when vectors are normalized
L2 (Euclidean) distanceStraight-line distanceSmaller is more similar; sensitive to magnitude

The key exam nuance: cosine similarity ignores vector length and compares only direction, which is why it is robust for text where document length varies. If you use self-managed embeddings and want cosine behavior, Databricks recommends normalizing the embeddings before indexing — scaling each vector to unit length so the similarity calculation depends on direction rather than raw magnitude. Skipping normalization can make longer or higher-magnitude vectors dominate matches for the wrong reasons. Once vectors are normalized, dot product and cosine similarity rank results identically.

Dimensionality and endpoint constraints

Higher dimensionality can capture more nuance but costs more storage and compute per query, so it is a trade-off, not a free upgrade. Databricks also enforces a concrete constraint you should memorize: for a storage-optimized Vector Search endpoint that uses precomputed (self-managed) embeddings, the embedding dimension must be evenly divisible by 16. This is a compatibility requirement of that endpoint type, not a general rule of embeddings — but it is exactly the kind of specific fact the exam likes. A dimension of 1024 (64 x 16) is fine; an arbitrary 1000 is not.

Batch embedding and precomputation

Embeddings are generated in two places: at ingestion time for every chunk (batch), and at query time for each incoming question. Batch-embedding the corpus during ingestion is standard, and precomputing lets you optimize for low latency at query time. When your app has strict low-latency requirements and you can precompute embeddings during ingestion, the fastest query-time setup is self-managed embeddings with the query vector supplied directly — Databricks skips the embedding-generation step and goes straight to vector lookup. By contrast, managed embeddings compute the query vector from text at request time, which adds inference work (and a cold-start risk if the embedding endpoint is scaled to zero) to every call.

Common traps

Watch three recurring mistakes. (1) Assuming you can freely swap embedding models — you cannot without reindexing. (2) Forgetting to normalize when you intend to use cosine similarity, which quietly degrades ranking. (3) Confusing semantic and keyword search: embeddings power meaning-based matching, while keyword/BM25-style search matches literal terms. Semantic search shines on paraphrases and synonyms; keyword matching shines on exact codes, SKUs, and proper nouns — which is exactly why hybrid retrieval (covered next) exists.

A worked cosine example

Suppose two chunk vectors point in nearly the same direction but one is twice as long. Under L2 distance the longer vector looks farther away purely because of its magnitude, even though it means the same thing. Under cosine similarity both score near 1.0 because only the angle matters. This is why cosine is the safe default for text of varying length, and why normalizing to unit length before indexing makes dot product and cosine agree: after normalization every vector has length 1, so their inner product is the cosine of the angle between them.

Re-ranking the candidate pool

Embeddings drive the first-stage retrieval, but they are not the last word on ordering. When recall is acceptable (the right chunk is somewhere in the top pool) yet top-k precision is poor, add a re-ranking stage: pull a larger candidate set with the embedding index, then re-score those candidates with a more expensive model that pushes the most relevant chunks to the very top. Re-ranking trades a little latency per query for better ordering, so apply it only after first-stage embedding retrieval is already working — it improves the use of your embeddings rather than replacing them.

Test Your Knowledge

You are using self-managed embeddings with Databricks Vector Search and want cosine-similarity behavior. Which preparation step does Databricks recommend?

A
B
C
D
Test Your Knowledge

You precomputed embeddings and want to use a storage-optimized Vector Search endpoint. Which requirement applies to the embedding dimension?

A
B
C
D
Test Your Knowledge

A team switches to a new embedding model and offline answer quality drops. Which fact best explains why old vectors cannot simply be reused with the new query embedder?

A
B
C
D