A tester budgets model inputs by counting characters and is surprised when requests still exceed the model's limit. Which foundational concept best explains this?

Model limits are measured in tokens, which are subword units that do not map one-to-one to characters or words. Limits and pricing are counted in tokens, the subword units produced by tokenizers such as BPE. Because a token is not equal to a character or a word, a character-based budget can silently exceed the model's token limit.

In a RAG system, why are embeddings and cosine similarity important to a tester?

They let the retriever find semantically relevant documents and let testers assert that an output means the same as an expected answer. Embeddings place similar meanings close together and cosine similarity measures that closeness, which drives retrieval and enables meaning-based oracles. High similarity still does not prove factual correctness, so it supports but does not replace correctness checks.

Transformers, tokenization, embeddings & con | Free Guide 2026

Why testers need transformer fundamentals

Generative AI systems under test are almost always built on the transformer architecture. As a tester you will not implement a transformer, but you must understand it well enough to reason about failure modes, design meaningful tests, and interpret unexpected outputs. The ISTQB CT-GenAI syllabus expects you to explain these building blocks conceptually and connect each one to a concrete testing implication rather than to internal mathematics.

Self-attention and why it enabled LLMs

The transformer, introduced in 2017, replaced recurrent networks with a mechanism called self-attention. Self-attention lets the model weigh the relevance of every token to every other token in the input, in parallel, rather than processing text strictly left to right. This parallelism made training on massive corpora feasible and let models capture long-range dependencies — the reason a large language model (LLM) can keep a pronoun consistent with a subject mentioned many sentences earlier. For testers, the key consequence is that outputs are context-sensitive and probabilistic: the same phrase can be completed differently depending on the surrounding tokens, so test cases must control the full input, not just the visible instruction, and must expect variation rather than a single fixed answer.

Most LLMs are decoder-only models that work by repeated next-token prediction: at each step the model uses self-attention over everything so far to predict the most likely next token, appends it, and repeats. Generation is therefore a sampling process, not a lookup of a stored answer, which is why two runs of the same prompt can diverge. This reframes test design: expected results are often ranges, patterns or constraints rather than one exact string, and a single passing run is weak evidence.

Tokenization: tokens are not words

Before a model sees text, a tokenizer splits it into tokens — subword units produced by algorithms such as Byte-Pair Encoding (BPE). A token may be a whole word, a fragment such as "test" or "ing", a space-prefixed chunk, or a single character for rare strings. A rough rule of thumb in English is that one token is about four characters, or roughly three-quarters of a word, but this varies widely by language and content. Numbers, source code, emoji and non-English scripts often fragment into many more tokens than their length suggests.

Tokenization has direct testing consequences. First, model limits and pricing are measured in tokens, not words, so a harness that budgets by character count can silently exceed a limit or mis-estimate cost. Second, tokenization explains classic failures: a model may miscount letters or mishandle spelling because it never truly "sees" individual characters, only tokens. Third, the same meaning expressed in a verbose language can consume far more tokens, which matters for multilingual coverage and for cost-based test design.

Embeddings and semantic similarity

Tokens are converted into embeddings — dense numeric vectors that place semantically similar items close together in a high-dimensional space. "Physician" and "doctor" land near each other even though they share no characters. Similarity between two vectors is usually measured with cosine similarity, which compares the angle between them and returns a value near 1 for closely related meanings and near 0 for unrelated ones.

Embeddings are central to modern test practice in two ways. In retrieval-augmented generation (RAG), a query is embedded and compared against stored document vectors to fetch relevant context, so retrieval quality depends directly on embedding quality. As a semantic test oracle, embeddings let you check whether an output means the same as an expected answer even when the wording differs — far more robust than exact string matching for non-deterministic generation. Testers must remember, though, that high similarity does not guarantee factual correctness; a fluent, on-topic but wrong answer can still score highly against a reference.

The context window and "lost in the middle"

Every model has a finite context window: the maximum number of tokens it can consider at once, covering the system prompt, user prompt, retrieved context, conversation history and the generated response together. When input exceeds this budget it must be truncated or summarized, and truncation silently drops information — a frequent, hard-to-spot defect that testers must design cases to catch. Even within the window, research shows a "lost in the middle" effect: models attend most reliably to content at the beginning and end of the input and can overlook facts buried in the middle. This makes long-context testing essential: verify behavior near the token limit, confirm that critical instructions are not dropped, and check whether relevant facts placed in different positions are actually used in the answer.

Summary table for testers

Concept	Definition	Testing implication
Self-attention	Weighs every token against every other token in parallel	Outputs are context-sensitive; control the entire input in each test
Token	Subword unit from BPE; not equal to a word	Budget and price by tokens; expect character-level errors
Embedding	Numeric vector representing meaning	Powers RAG retrieval and semantic oracles
Cosine similarity	Angle-based closeness of two vectors	Meaning-based assertions, not a correctness proof
Context window	Maximum tokens processed at once	Test truncation and "lost in the middle" near the limit

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

1.1 Transformers, tokenization, embeddings & context window

Key Takeaways

Why testers need transformer fundamentals

Self-attention and why it enabled LLMs

Tokenization: tokens are not words

Embeddings and semantic similarity

The context window and "lost in the middle"

Summary table for testers

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

1.1 Transformers, tokenization, embeddings & context window

Key Takeaways

Why testers need transformer fundamentals

Self-attention and why it enabled LLMs

Tokenization: tokens are not words

Embeddings and semantic similarity

The context window and "lost in the middle"

Summary table for testers