2.3 Robustness, privacy, latency & cost

Key Takeaways

  • Robustness is the stability of output quality under meaning-preserving perturbations and adversarial input, tested with invariance and metamorphic testing.
  • Privacy covers PII leakage and training-data memorization/extraction, measured with PII detectors, extraction/canary attacks, and redaction checks.
  • Latency has structure: time-to-first-token, inter-token speed, end-to-end latency, and system throughput, reported as percentiles under load.
  • Cost is a real operational constraint because inference is billed per token, with input and output tokens usually priced separately.
  • These non-functional attributes need explicit thresholds, since a correct answer that is slow, expensive, or leaks data still fails in production.
Last updated: July 2026

Non-functional quality attributes

Robustness, privacy, latency, and cost describe how a GenAI system behaves as an operational component rather than the meaning of any single answer. They are non-functional attributes: they constrain how the system responds, not what it says. Testers treat them as first-class requirements with explicit thresholds, because a correct answer that is slow, expensive, leaks personal data, or collapses under a typo still fails in production. Each attribute is verified against a target — for example a p95 latency budget or a maximum cost-per-session — just like a performance or security requirement in classic testing.

Robustness

Robustness is the stability of output quality when the input is perturbed. Small, meaning-preserving changes — typos, extra whitespace, a paraphrase, reordered sentences, or a different but equivalent prompt template — should not swing the answer. Robustness also covers adversarial input: prompt injection, distractor text, and out-of-distribution or malformed inputs.

How to measure: apply controlled perturbations to a fixed test set and measure the change in output or metric (invariance testing). Useful signals are the rate at which a correct answer flips to incorrect under perturbation, the variance of outputs across paraphrases, and the success rate of adversarial or injection attacks. Metamorphic testing is the standard technique: define a relation such as "paraphrasing the question must not change the answer," then flag every violation of that relation. Robustness testing must also account for the model's own non-determinism — the same prompt can yield different outputs on repeat calls — so a tester distinguishes acceptable natural variation from a genuine instability triggered by the perturbation, often by sampling several runs per input.

Privacy

Privacy covers protection of personal and sensitive data across the system. Two GenAI-specific concerns dominate: (1) PII leakage — the model emitting names, emails, identifiers, or health data in its output, whether echoed from the prompt or produced from memory; and (2) training-data memorization / extraction — an attacker prompting the model to regurgitate verbatim secrets or copyrighted text it saw in training. Data-protection obligations such as GDPR also govern how prompts and logs are stored and retained.

How to measure: scan inputs and outputs with PII detectors or named-entity recognition and count leaked entities; run extraction attacks that prompt for memorized data and measure verbatim reproduction; and verify that redaction, anonymization, and data-retention controls actually work. A canary or membership-inference approach — inserting a unique marker in training data and testing whether it can later be extracted — quantifies memorization directly.

Latency

Latency is how quickly the system responds — a direct driver of user experience. For streaming LLMs it has internal structure: time-to-first-token (TTFT), the delay before the first token appears; inter-token latency / tokens-per-second, the streaming speed once generation starts; and total (end-to-end) latency for the full response. Throughput is the system-level view: requests or tokens served per second under concurrency.

How to measure: benchmark under representative load and report percentiles (p50, p95, p99) rather than averages, because tail latency drives perceived slowness. Measure TTFT and total latency separately, and test across different prompt and response lengths and concurrency levels, since latency grows with token count and with contention.

Cost

Cost is a genuine test and operational constraint for GenAI because inference is billed per token. Most APIs price input and output tokens separately (output usually costs more), so long prompts, large retrieved contexts, and verbose answers all raise the bill. Cost-per-request and cost-per-user-session therefore become non-functional requirements sitting alongside latency.

How to measure: track tokens per request (input + output) and multiply by the model's per-token price to get cost-per-request; aggregate to cost-per-session and to a projected monthly cost at expected volume. Then compare configurations — smaller model, shorter context, prompt caching, tighter output limits — against quality to find the quality-per-dollar trade-off.

Non-functional attribute table

AttributeDefinitionKey metrics / how to measure
RobustnessStable quality under input perturbation / adversarial inputFlip rate under perturbation, output variance across paraphrases, injection-attack success; metamorphic/invariance tests
PrivacyNo PII leakage or training-data extraction; data protectedPII-detector counts, extraction/canary attacks, redaction & retention checks
LatencySpeed of responseTTFT, inter-token / tokens-per-sec, end-to-end latency, throughput; report p50/p95/p99 under load
CostPer-token inference expense as a constraintTokens/request x price = cost/request; cost/session; quality-per-dollar comparison

Trade-offs and SLAs

These four attributes rarely move independently. A larger model may improve factuality but worsen latency and cost; aggressive output truncation cuts cost but can hurt coherence; heavy input sanitization improves privacy but adds latency. Testers therefore express them as an SLA-style budget — for example "p95 latency < 3 s, cost-per-session < a set ceiling, zero PII in output" — and check that a change which improves one attribute does not silently break another.

Exam tips: report latency as percentiles under load, never as a single average; remember that input and output tokens are priced separately; and note that robustness is about stability, so it is tested by changing the input and checking whether the output holds.

Test Your Knowledge

"Time-to-first-token" (TTFT) is a sub-metric of which quality attribute?

A
B
C
D
Test Your Knowledge

Which statement about cost as a GenAI quality attribute is correct?

A
B
C
D
Test Your Knowledge

A model returns the correct answer, but simply paraphrasing the question or adding a typo flips it to an incorrect answer. Which attribute is weak, and how is it best tested?

A
B
C
D