4.2 Tokens, Context Windows, Embeddings, and Vector Search

Key Takeaways

  • Tokens are the units a generative model reads and writes, so token volume affects cost, latency, and how much context can fit in one request.
  • A context window is the total space available for instructions, user input, retrieved content, and output, but a larger window does not automatically mean better answers.
  • Embeddings convert content into numeric vectors that represent semantic meaning and support similarity search.
  • Vector search is a core building block for retrieval augmented generation, where trusted content is retrieved before a model generates an answer.
Last updated: May 2026

Tokens and context windows

A token is a chunk of text that a model processes. In English, a token may be a whole word, part of a word, punctuation, or a formatting marker. The exact tokenization depends on the model. Practitioners do not need to tokenize text manually, but they should understand that model input and output are measured in tokens. This matters because tokens influence cost, latency, and the maximum amount of information a model can consider in one request.

A context window is the model's working space for a single interaction. It includes system instructions, developer or application instructions, user input, retrieved passages, examples, conversation history, and the model's generated output. If too much content is included, the request may exceed the window or force the application to remove useful material. If too little content is included, the model may answer from general knowledge rather than the organization's source of truth.

A larger context window is helpful, but it is not the same as accuracy or memory. Long prompts can bury the most important facts, mix outdated and current policy, or include irrelevant content that distracts the model. The practical question is not simply how many tokens fit. The question is whether the application gives the model the right evidence, in the right order, with clear instructions about what to do when evidence is missing.

Embeddings and vector search

Embeddings are numeric representations of meaning. An embedding model converts text, and sometimes other content types, into a vector: a list of numbers that captures semantic relationships. Similar ideas tend to have vectors that are close together, even when the exact words differ. That is why a search for password reset can find a document that says account recovery. The application is not just matching keywords; it is comparing meaning.

Vector search uses those embeddings to find similar content quickly. In a retrieval augmented generation workflow, the application embeds a user question, searches a vector index for relevant chunks, sends those chunks to the foundation model as context, and asks the model to answer using the retrieved material. AWS services can support this pattern in different ways. Knowledge Bases for Amazon Bedrock can connect data sources, create embeddings, use a vector store, and retrieve relevant context for Bedrock applications.

Amazon OpenSearch Service and OpenSearch Serverless are common vector search options in AWS architectures.

Building blockPractical purposeCommon pitfall
TokenUnit of model input or outputIgnoring token cost and response length
Context windowSpace for prompt, evidence, history, and outputFilling it with noisy or stale material
EmbeddingNumeric representation of meaningAssuming embeddings prove facts are true
Vector indexSearch structure for similar vectorsForgetting permissions, freshness, or source quality
RAGRetrieve trusted context before generationTreating retrieval as a guarantee against every error

What a non-builder should ask

A business sponsor does not need to tune embedding dimensions, but the sponsor should ask whether the content pipeline is trustworthy. Retrieval quality depends on clean source documents, sensible chunking, useful metadata, access controls, and regular refresh. A policy manual split into random fragments may retrieve incomplete guidance. A knowledge base that indexes obsolete files may produce answers that are fluent but wrong. A vector index without permission filtering can expose information to users who should not see it.

For an AWS AI Practitioner candidate, the scenario judgment is straightforward. If a chatbot must answer questions from private company documents, plain prompting is usually weaker than a retrieval design. If the task is open-ended creative drafting, embeddings may be less central. If the answer must cite approved procedures, retrieval, citations, and refusal behavior become important. When cost or latency is tight, the team may need smaller prompts, better retrieval filters, or a model that balances speed and quality.

Use this readiness checklist before approving a vector-search-backed GenAI workflow:

  • Source documents are authoritative, current, and owned by a business team.
  • Sensitive data has been reviewed before indexing, with help from services such as Amazon Macie when appropriate.
  • Users only retrieve content they are allowed to access through IAM-aware application design and source permissions.
  • The application has a plan for content refresh, deletion, retention, and audit logging.
  • Test questions cover synonyms, abbreviations, missing information, and conflicting documents.

The important mental model is that embeddings find likely relevant content, while the foundation model turns selected context into a natural-language answer. Both pieces can fail in different ways. Retrieval can miss the right document, retrieve a weak chunk, or expose stale information. Generation can overstate, omit uncertainty, or blend retrieved facts with general language. Strong designs evaluate the full path, not only the model.

Test Your Knowledge

Why do tokens matter in a generative AI application?

A
B
C
D
Test Your Knowledge

What is the role of embeddings in a RAG workflow?

A
B
C
D
Test Your Knowledge

A model has a very large context window. What should a practitioner still worry about?

A
B
C
D