4.4 Training, testing & release readiness

Key Takeaways

  • Release readiness is multi-dimensional: performance metrics (accuracy, precision, recall, F1, AUC) must be reported disaggregated by subgroup, not just as a headline aggregate.
  • Common fairness definitions — demographic parity, equal opportunity, equalized odds, predictive parity — are mathematically incompatible, so choosing one is a documented governance decision tied to which error causes the least harm.
  • Disciplined train/validation/test splits are essential; a large gap between training and test performance signals overfitting, and data leakage produces the same false confidence.
  • Robustness testing and red-teaming (structured adversarial probing) are expected for high-capability and generative models; bias mitigation can be pre-processing, in-processing, or post-processing and must be re-measured after application.
  • Model cards document intended use, data, disaggregated metrics, and limitations; a formal sign-off gate (go / go-with-conditions / no-go) precedes deployment and hands off to post-deployment monitoring against the recorded baseline.
Last updated: July 2026

From Trained Model to Release Decision

The final stage of governed development turns a trained model into a defensible release decision. It combines rigorous evaluation — accuracy, fairness, and robustness — with documentation and a formal sign-off gate. AIGP candidates should understand that a model posting a strong headline accuracy number is not automatically release-ready; readiness is a multi-dimensional judgment that is recorded and approved by an accountable owner.

Measuring Accuracy and Fairness

Aggregate accuracy hides the failures governance cares about, so evaluation must be disaggregated. Standard performance metrics — accuracy, precision, recall, F1, and AUC — should be reported both overall and per subgroup, because a model can post 95% aggregate accuracy while failing badly for a minority group. Fairness metrics then quantify disparity, and candidates should know that the common definitions are mutually incompatible — you generally cannot satisfy all at once, so the choice is a documented governance decision tied to the use case:

  • Demographic (statistical) parity — positive outcomes are distributed equally across groups.
  • Equal opportunity — equal true-positive rates across groups.
  • Equalized odds — equal true-positive and false-positive rates across groups.
  • Predictive parity — equal precision (positive predictive value) across groups.

Because these criteria conflict (the "impossibility" results in fairness research), governance selects the metric whose error profile causes the least harm in context — for example, prioritizing equal false-negative rates where a missed positive is the graver harm.

Validation, Testing, and Overfitting

Sound evaluation depends on disciplined data splits. The training set fits the model, the validation set tunes hyperparameters and guides model selection, and a held-out test set — untouched until the end — estimates real-world performance. Confusing these roles produces optimistic, misleading results. Overfitting, where a model memorizes training data and generalizes poorly, is detected when training performance far exceeds validation and test performance; remedies include regularization, more (or more representative) data, and simpler models. Data leakage — test information contaminating training — produces the same false confidence and must be actively guarded against.

Robustness, Red-Teaming, and Bias Mitigation

Beyond average-case accuracy, release readiness requires stress testing. Robustness and adversarial testing probe how the model behaves under distribution shift, noisy inputs, and deliberate attack (adversarial examples, jailbreaks, and prompt injection for generative systems). Red-teaming — structured adversarial probing by people trying to make the system fail or produce harmful output — has become a central expectation for high-capability and generative models, and is referenced in the EU AI Act's obligations for general-purpose AI models with systemic risk. Where testing surfaces unacceptable bias, teams apply mitigation at one of three stages:

StageTechniqueExample
Pre-processingFix the data before trainingReweighting or resampling under-represented groups
In-processingConstrain the learning objectiveFairness constraints or adversarial debiasing during training
Post-processingAdjust outputs after trainingGroup-specific decision thresholds

Mitigation is iterative: after applying a technique, the team re-measures both fairness and accuracy to confirm the trade-off is acceptable and no new harm was introduced. A common failure is to apply a mitigation once, observe an improved fairness number, and stop — without checking whether accuracy collapsed for another group or whether the gain merely shifted harm elsewhere.

Model Documentation and Release Sign-off

Documentation is the connective tissue that makes a release auditable. The reference artifact is the model card (Mitchell et al.), a short document reporting the model's intended use, out-of-scope uses, training and evaluation data, disaggregated performance and fairness metrics, ethical considerations, and known limitations. Model cards pair with the datasheets from the data-governance stage to document the full pipeline and directly support the EU AI Act's technical-documentation requirements for high-risk systems.

Release readiness is then formalized as a sign-off gate — a governance checkpoint at which an accountable owner (often supported by a review board) decides go, go-with-conditions, or no-go. A release-readiness checklist typically confirms that:

  • The system meets its accuracy, fairness, and robustness acceptance criteria.
  • Red-teaming and security testing found no unresolved critical issues.
  • Human-oversight mechanisms are implemented and tested.
  • Model cards, datasheets, and the risk assessment are complete.
  • Legal and compliance obligations (for example, conformity assessment and transparency notices) are satisfied.
  • Monitoring, incident-response, and rollback plans exist for post-deployment.

A useful discipline is to separate two questions the sign-off must answer. Verification asks "did we build the system right?" — does it meet the technical acceptance criteria and specifications. Validation asks "did we build the right system?" — does it actually serve the intended purpose safely for real users in the deployment context. A model can pass verification (high test accuracy) yet fail validation (unfit for the population it will face), so release readiness requires both. Sign-off should also weigh residual risk: no system is risk-free, and the accountable owner is explicitly accepting the risks that remain after mitigation, which is why documenting known limitations in the model card is not optional. For EU high-risk systems, several release artifacts are legally load-bearing — the technical documentation, the declaration of conformity, CE marking, and registration in the EU database — so the sign-off gate is also the point at which legal compliance is confirmed rather than assumed.

Crucially, sign-off is not the end of governance but a hand-off to deployment-stage monitoring: models drift, data changes, and new attacks emerge, so the release record establishes the baseline against which post-deployment performance will be measured and re-approved. This closes the development loop — from use-case definition, through design, data, and testing — with a documented, accountable decision to deploy.

Test Your Knowledge

A model reports 98% accuracy on the data it was trained on but only 71% on a held-out test set. This gap most strongly indicates which problem?

A
B
C
D
Test Your Knowledge

During evaluation, a team finds it cannot simultaneously satisfy demographic parity, equalized odds, and predictive parity. What is the correct governance response?

A
B
C
D