4.4 Training, testing & release readiness

Key Takeaways

Release readiness is multi-dimensional: performance metrics (accuracy, precision, recall, F1, AUC) must be reported disaggregated by subgroup, not just as a headline aggregate.
Common fairness definitions — demographic parity, equal opportunity, equalized odds, predictive parity — are mathematically incompatible, so choosing one is a documented governance decision tied to which error causes the least harm.
Disciplined train/validation/test splits are essential; a large gap between training and test performance signals overfitting, and data leakage produces the same false confidence.
Robustness testing and red-teaming (structured adversarial probing) are expected for high-capability and generative models; bias mitigation can be pre-processing, in-processing, or post-processing and must be re-measured after application.
Model cards document intended use, data, disaggregated metrics, and limitations; a formal sign-off gate (go / go-with-conditions / no-go) precedes deployment and hands off to post-deployment monitoring against the recorded baseline.

Last updated: July 2026

From Trained Model to Release Decision

The final stage of governed development turns a trained model into a defensible release decision. It combines rigorous evaluation — accuracy, fairness, and robustness — with documentation and a formal sign-off gate. AIGP candidates should understand that a model posting a strong headline accuracy number is not automatically release-ready; readiness is a multi-dimensional judgment that is recorded and approved by an accountable owner.

Measuring Accuracy and Fairness

Aggregate accuracy hides the failures governance cares about, so evaluation must be disaggregated. Standard performance metrics — accuracy, precision, recall, F1, and AUC — should be reported both overall and per subgroup, because a model can post 95% aggregate accuracy while failing badly for a minority group. Fairness metrics then quantify disparity, and candidates should know that the common definitions are mutually incompatible — you generally cannot satisfy all at once, so the choice is a documented governance decision tied to the use case:

Demographic (statistical) parity — positive outcomes are distributed equally across groups.
Equal opportunity — equal true-positive rates across groups.
Equalized odds — equal true-positive and false-positive rates across groups.
Predictive parity — equal precision (positive predictive value) across groups.

Because these criteria conflict (the "impossibility" results in fairness research), governance selects the metric whose error profile causes the least harm in context — for example, prioritizing equal false-negative rates where a missed positive is the graver harm.

Validation, Testing, and Overfitting

Sound evaluation depends on disciplined data splits. The training set fits the model, the validation set tunes hyperparameters and guides model selection, and a held-out test set — untouched until the end — estimates real-world performance. Confusing these roles produces optimistic, misleading results. Overfitting, where a model memorizes training data and generalizes poorly, is detected when training performance far exceeds validation and test performance; remedies include regularization, more (or more representative) data, and simpler models. Data leakage — test information contaminating training — produces the same false confidence and must be actively guarded against.

Robustness, Red-Teaming, and Bias Mitigation

Beyond average-case accuracy, release readiness requires stress testing. Robustness and adversarial testing probe how the model behaves under distribution shift, noisy inputs, and deliberate attack (adversarial examples, jailbreaks, and prompt injection for generative systems). Red-teaming — structured adversarial probing by people trying to make the system fail or produce harmful output — has become a central expectation for high-capability and generative models, and is referenced in the EU AI Act's obligations for general-purpose AI models with systemic risk. Where testing surfaces unacceptable bias, teams apply mitigation at one of three stages:

Stage	Technique	Example
Pre-processing	Fix the data before training	Reweighting or resampling under-represented groups
In-processing	Constrain the learning objective	Fairness constraints or adversarial debiasing during training
Post-processing	Adjust outputs after training	Group-specific decision thresholds

Mitigation is iterative: after applying a technique, the team re-measures both fairness and accuracy to confirm the trade-off is acceptable and no new harm was introduced. A common failure is to apply a mitigation once, observe an improved fairness number, and stop — without checking whether accuracy collapsed for another group or whether the gain merely shifted harm elsewhere.

Model Documentation and Release Sign-off

Documentation is the connective tissue that makes a release auditable. The reference artifact is the model card (Mitchell et al.), a short document reporting the model's intended use, out-of-scope uses, training and evaluation data, disaggregated performance and fairness metrics, ethical considerations, and known limitations. Model cards pair with the datasheets from the data-governance stage to document the full pipeline and directly support the EU AI Act's technical-documentation requirements for high-risk systems.

Release readiness is then formalized as a sign-off gate — a governance checkpoint at which an accountable owner (often supported by a review board) decides go, go-with-conditions, or no-go. A release-readiness checklist typically confirms that:

The system meets its accuracy, fairness, and robustness acceptance criteria.
Red-teaming and security testing found no unresolved critical issues.
Human-oversight mechanisms are implemented and tested.
Model cards, datasheets, and the risk assessment are complete.
Legal and compliance obligations (for example, conformity assessment and transparency notices) are satisfied.
Monitoring, incident-response, and rollback plans exist for post-deployment.

A useful discipline is to separate two questions the sign-off must answer. Verification asks "did we build the system right?" — does it meet the technical acceptance criteria and specifications. Validation asks "did we build the right system?" — does it actually serve the intended purpose safely for real users in the deployment context. A model can pass verification (high test accuracy) yet fail validation (unfit for the population it will face), so release readiness requires both. Sign-off should also weigh residual risk: no system is risk-free, and the accountable owner is explicitly accepting the risks that remain after mitigation, which is why documenting known limitations in the model card is not optional. For EU high-risk systems, several release artifacts are legally load-bearing — the technical documentation, the declaration of conformity, CE marking, and registration in the EU database — so the sign-off gate is also the point at which legal compliance is confirmed rather than assumed.

Crucially, sign-off is not the end of governance but a hand-off to deployment-stage monitoring: models drift, data changes, and new attacks emerge, so the release record establishes the baseline against which post-deployment performance will be measured and re-approved. This closes the development loop — from use-case definition, through design, data, and testing — with a documented, accountable decision to deploy.

Test Your Knowledge

A model reports 98% accuracy on the data it was trained on but only 71% on a held-out test set. This gap most strongly indicates which problem?

Demographic parity has been achieved

The training data was poisoned by an attacker

A prompt-injection vulnerability in the inference API

Overfitting

Test Your Knowledge

During evaluation, a team finds it cannot simultaneously satisfy demographic parity, equalized odds, and predictive parity. What is the correct governance response?

Select the fairness metric whose error profile causes the least harm in context and document the rationale

Abandon fairness testing altogether because the metrics conflict

Report only aggregate accuracy to stakeholders and move on

Delete the protected attribute so the conflict disappears

Up Next

5.1 Deployment decisions & human oversight

Governing AI Deployment & Use

IAPP Artificial Intelligence Governance Professional

1Overview & AI Governance Foundations

2Laws & Regulations Applying to AI

3Standards & Frameworks

4Governing AI Development

5Governing AI Deployment & Use

6Operationalizing AI Governance

AIGP