4.3 Data governance: quality, bias & provenance

Key Takeaways

  • Data governance for AI is continuous across collection, preparation, use, and retention; representativeness is the quality dimension most tied to fairness, since under-represented subgroups get worse accuracy.
  • Bias has distinct sources — historical, sampling/selection, labeling/measurement, proxy-variable, and aggregation — and mitigation depends on naming the source precisely.
  • Removing a protected attribute does not remove discrimination when proxy variables (ZIP code, name, purchase history) reconstruct it, so bias must be tested in outcomes, not just screened out of inputs.
  • Provenance and lineage plus a datasheet for datasets (origin, collection method, composition, appropriate uses) make a dataset auditable and support the EU AI Act's data-governance and documentation duties.
  • Personal training data requires a GDPR lawful basis, purpose limitation, transparency, and operable data-subject rights; data minimization trades off against model performance and must be resolved with a documented necessity-and-proportionality judgment.
Last updated: July 2026

Data as the Root of AI Risk

Because machine-learning systems learn their behavior from data, most fairness, accuracy, and privacy failures trace back to the data pipeline. Data governance for AI therefore sits at the center of development-stage controls. It covers the quality and representativeness of training data, the many sources of bias that data can encode, the provenance and documentation that make a dataset auditable, and the legal basis for using personal data to train models. AIGP candidates should treat data governance as continuous — spanning collection, preparation, use, and retention — rather than a one-time cleanup before training.

Data Quality and Representativeness

Quality is multi-dimensional. A dataset can be accurate yet unrepresentative, complete yet stale, or large yet biased. The governance-relevant dimensions include:

  • Accuracy — values correctly reflect reality; labels are correct.
  • Completeness — required fields are populated; missing data is understood, not silently imputed.
  • Consistency — formats, units, and definitions agree across sources.
  • Timeliness — data reflects the current world, not an outdated distribution.
  • Representativeness — the sample matches the population the model will serve, across all relevant subgroups.

Representativeness is the dimension most tied to fairness. A model trained on data that under-represents a subgroup will typically perform worse for that subgroup, producing disparate accuracy even when no protected attribute appears in the features. Governance responses include stratified sampling, targeted data collection for under-represented groups, and per-subgroup performance testing rather than reliance on a single aggregate score.

Sources of Bias

Bias enters through many doors, and mitigation depends on naming the source precisely:

Bias sourceHow it arisesExample
Historical biasThe world the data records is itself unequalPast hiring favored one group, so labels encode it
Sampling / selection biasThe sample does not match the populationTraining only on urban customers
Labeling / measurement biasHuman labels or proxies are subjective or flawedInconsistent "risk" labels across annotators
Proxy-variable biasA neutral feature correlates with a protected traitZIP code standing in for race
Aggregation biasOne model is forced onto distinct subgroupsSame model for populations with different base rates

Proxy variables deserve emphasis: removing a protected attribute such as race or gender does not remove discrimination if correlated features (ZIP code, name, purchase history) let the model reconstruct it. This is why "fairness through unawareness" is insufficient, and why bias must be tested for in outcomes, not just screened out of inputs.

Provenance, Lineage, and Documentation

Data provenance (where data came from) and lineage (how it was transformed) make a dataset trustworthy and auditable. Governance requires recording the source of every dataset, the consent or legal basis under which it was obtained, the transformations applied, and the versions used to train each model. The reference documentation artifact is the datasheet for datasets (Gebru et al.), a standardized record answering why a dataset was created, how it was collected, who is represented, what preprocessing occurred, and what uses are appropriate or discouraged. Datasheets pair with model cards (covered in the next section) to give a complete lineage from data to deployed model, and they directly support the EU AI Act's data-governance and technical-documentation obligations for high-risk systems.

Lawful Basis and the Minimization Trade-off

Training data that contains personal information triggers data-protection law. Under the GDPR, controllers must identify a lawful basis (consent, legitimate interests, contract, and so on) for using personal data to train a model, honor purpose limitation (data collected for one purpose cannot be freely repurposed for AI training), and respect special-category protections for sensitive data. Transparency obligations require informing individuals that their data may be used for training, and data-subject rights (access, erasure, objection) must remain operable even once data is embedded in a pipeline.

Data minimization — collecting and retaining only what is necessary — sits in direct tension with the machine-learning intuition that more data yields better models. Good governance resolves this deliberately rather than defaulting to "collect everything." Techniques include using only features with demonstrated predictive value, applying privacy-enhancing technologies such as anonymization, pseudonymization, synthetic data, and differential privacy, and setting retention limits with defined deletion. The organization should document the necessity-and-proportionality judgment: what data is used, why each element is needed, what performance is gained, and what privacy risk is accepted. That record connects data governance back to the impact assessment and forward to release readiness.

Data labeling deserves its own governance attention because labels define the target the model learns to predict, and flawed labels silently cap the ceiling on fairness and accuracy. Governance covers annotator selection and training, clear labeling guidelines, inter-annotator agreement measurement, adjudication of disputed labels, and awareness that a chosen proxy label (for example, "arrested" standing in for "committed a crime," or "clicked" standing in for "found useful") may not measure the true construct of interest — a subtle but consequential source of measurement bias. Two related governance duties round out the stage. First, third-party and web-scraped data carry provenance and licensing risk: teams must confirm they have the right to use the data for training and that it does not embed others' copyrighted or personal content unlawfully. Second, data versioning — snapshotting exactly which data trained which model — is what makes an incident later reproducible and a regulator's questions answerable. Together these practices ensure the data foundation is not just usable but defensible.

Test Your Knowledge

A lending model excludes race from its input features, yet still approves loans at systematically lower rates for a protected group. What is the most likely cause?

A
B
C
D
Test Your Knowledge

What is the primary purpose of a 'datasheet for datasets'?

A
B
C
D
Test Your Knowledge

An AI team wants to collect every available field about customers because 'more data means a better model.' Which data-protection principle most directly challenges this approach?

A
B
C
D