1.1 Transformers, tokenization, embeddings & context window

Key Takeaways

  • Transformers use self-attention to weigh every token against every other token in parallel, which is why LLM outputs are context-sensitive and probabilistic rather than fixed lookups.
  • Tokens are subword units produced by algorithms such as Byte-Pair Encoding, so model limits and API costs are measured in tokens, not words or characters.
  • Embeddings represent meaning as numeric vectors, and cosine similarity measures how closely two texts relate, enabling RAG retrieval and semantic test oracles.
  • The context window is a finite token budget covering system prompt, user prompt, retrieved context, history and output; exceeding it forces truncation that silently drops information.
  • The 'lost in the middle' effect means models attend most reliably to the start and end of the input, making long-context and instruction-position testing essential.
Last updated: July 2026

Why testers need transformer fundamentals

Generative AI systems under test are almost always built on the transformer architecture. As a tester you will not implement a transformer, but you must understand it well enough to reason about failure modes, design meaningful tests, and interpret unexpected outputs. The ISTQB CT-GenAI syllabus expects you to explain these building blocks conceptually and connect each one to a concrete testing implication rather than to internal mathematics.

Self-attention and why it enabled LLMs

The transformer, introduced in 2017, replaced recurrent networks with a mechanism called self-attention. Self-attention lets the model weigh the relevance of every token to every other token in the input, in parallel, rather than processing text strictly left to right. This parallelism made training on massive corpora feasible and let models capture long-range dependencies — the reason a large language model (LLM) can keep a pronoun consistent with a subject mentioned many sentences earlier. For testers, the key consequence is that outputs are context-sensitive and probabilistic: the same phrase can be completed differently depending on the surrounding tokens, so test cases must control the full input, not just the visible instruction, and must expect variation rather than a single fixed answer.

Most LLMs are decoder-only models that work by repeated next-token prediction: at each step the model uses self-attention over everything so far to predict the most likely next token, appends it, and repeats. Generation is therefore a sampling process, not a lookup of a stored answer, which is why two runs of the same prompt can diverge. This reframes test design: expected results are often ranges, patterns or constraints rather than one exact string, and a single passing run is weak evidence.

Tokenization: tokens are not words

Before a model sees text, a tokenizer splits it into tokens — subword units produced by algorithms such as Byte-Pair Encoding (BPE). A token may be a whole word, a fragment such as "test" or "ing", a space-prefixed chunk, or a single character for rare strings. A rough rule of thumb in English is that one token is about four characters, or roughly three-quarters of a word, but this varies widely by language and content. Numbers, source code, emoji and non-English scripts often fragment into many more tokens than their length suggests.

Tokenization has direct testing consequences. First, model limits and pricing are measured in tokens, not words, so a harness that budgets by character count can silently exceed a limit or mis-estimate cost. Second, tokenization explains classic failures: a model may miscount letters or mishandle spelling because it never truly "sees" individual characters, only tokens. Third, the same meaning expressed in a verbose language can consume far more tokens, which matters for multilingual coverage and for cost-based test design.

Embeddings and semantic similarity

Tokens are converted into embeddings — dense numeric vectors that place semantically similar items close together in a high-dimensional space. "Physician" and "doctor" land near each other even though they share no characters. Similarity between two vectors is usually measured with cosine similarity, which compares the angle between them and returns a value near 1 for closely related meanings and near 0 for unrelated ones.

Embeddings are central to modern test practice in two ways. In retrieval-augmented generation (RAG), a query is embedded and compared against stored document vectors to fetch relevant context, so retrieval quality depends directly on embedding quality. As a semantic test oracle, embeddings let you check whether an output means the same as an expected answer even when the wording differs — far more robust than exact string matching for non-deterministic generation. Testers must remember, though, that high similarity does not guarantee factual correctness; a fluent, on-topic but wrong answer can still score highly against a reference.

The context window and "lost in the middle"

Every model has a finite context window: the maximum number of tokens it can consider at once, covering the system prompt, user prompt, retrieved context, conversation history and the generated response together. When input exceeds this budget it must be truncated or summarized, and truncation silently drops information — a frequent, hard-to-spot defect that testers must design cases to catch. Even within the window, research shows a "lost in the middle" effect: models attend most reliably to content at the beginning and end of the input and can overlook facts buried in the middle. This makes long-context testing essential: verify behavior near the token limit, confirm that critical instructions are not dropped, and check whether relevant facts placed in different positions are actually used in the answer.

Summary table for testers

ConceptDefinitionTesting implication
Self-attentionWeighs every token against every other token in parallelOutputs are context-sensitive; control the entire input in each test
TokenSubword unit from BPE; not equal to a wordBudget and price by tokens; expect character-level errors
EmbeddingNumeric vector representing meaningPowers RAG retrieval and semantic oracles
Cosine similarityAngle-based closeness of two vectorsMeaning-based assertions, not a correctness proof
Context windowMaximum tokens processed at onceTest truncation and "lost in the middle" near the limit
Test Your Knowledge

A tester budgets model inputs by counting characters and is surprised when requests still exceed the model's limit. Which foundational concept best explains this?

A
B
C
D
Test Your Knowledge

In a RAG system, why are embeddings and cosine similarity important to a tester?

A
B
C
D
Test Your Knowledge

A long document is pasted into a prompt and the model ignores a critical instruction placed in the middle of it. Which effect best describes this, and what testing does it motivate?

A
B
C
D