A lending model excludes race from its input features, yet still approves loans at systematically lower rates for a protected group. What is the most likely cause?

Proxy variables such as ZIP code that correlate with the protected attribute. This is proxy-variable bias. Even after a protected attribute is removed, correlated features like ZIP code, name, or purchase history let the model reconstruct it — which is why 'fairness through unawareness' fails and bias must be measured in outcomes, not just excluded from inputs.

What is the primary purpose of a 'datasheet for datasets'?

To document a dataset's origin, collection method, composition, and appropriate or discouraged uses. A datasheet for datasets standardizes documentation of why and how a dataset was created, who is represented, what preprocessing occurred, and what uses are appropriate. Reporting a deployed model's performance is the role of a model card, not a datasheet.

An AI team wants to collect every available field about customers because 'more data means a better model.' Which data-protection principle most directly challenges this approach?

Data minimization. Data minimization requires collecting and retaining only what is necessary, which sits in direct tension with 'collect everything.' Governance resolves the trade-off deliberately — using only predictive features, applying privacy-enhancing techniques, and documenting a necessity-and-proportionality judgment.

Data governance: quality, bias & provenance | Free Guide 2026

Data as the Root of AI Risk

Because machine-learning systems learn their behavior from data, most fairness, accuracy, and privacy failures trace back to the data pipeline. Data governance for AI therefore sits at the center of development-stage controls. It covers the quality and representativeness of training data, the many sources of bias that data can encode, the provenance and documentation that make a dataset auditable, and the legal basis for using personal data to train models. AIGP candidates should treat data governance as continuous — spanning collection, preparation, use, and retention — rather than a one-time cleanup before training.

Data Quality and Representativeness

Quality is multi-dimensional. A dataset can be accurate yet unrepresentative, complete yet stale, or large yet biased. The governance-relevant dimensions include:

Accuracy — values correctly reflect reality; labels are correct.
Completeness — required fields are populated; missing data is understood, not silently imputed.
Consistency — formats, units, and definitions agree across sources.
Timeliness — data reflects the current world, not an outdated distribution.
Representativeness — the sample matches the population the model will serve, across all relevant subgroups.

Representativeness is the dimension most tied to fairness. A model trained on data that under-represents a subgroup will typically perform worse for that subgroup, producing disparate accuracy even when no protected attribute appears in the features. Governance responses include stratified sampling, targeted data collection for under-represented groups, and per-subgroup performance testing rather than reliance on a single aggregate score.

Sources of Bias

Bias enters through many doors, and mitigation depends on naming the source precisely:

Bias source	How it arises	Example
Historical bias	The world the data records is itself unequal	Past hiring favored one group, so labels encode it
Sampling / selection bias	The sample does not match the population	Training only on urban customers
Labeling / measurement bias	Human labels or proxies are subjective or flawed	Inconsistent "risk" labels across annotators
Proxy-variable bias	A neutral feature correlates with a protected trait	ZIP code standing in for race
Aggregation bias	One model is forced onto distinct subgroups	Same model for populations with different base rates

Proxy variables deserve emphasis: removing a protected attribute such as race or gender does not remove discrimination if correlated features (ZIP code, name, purchase history) let the model reconstruct it. This is why "fairness through unawareness" is insufficient, and why bias must be tested for in outcomes, not just screened out of inputs.

Provenance, Lineage, and Documentation

Data provenance (where data came from) and lineage (how it was transformed) make a dataset trustworthy and auditable. Governance requires recording the source of every dataset, the consent or legal basis under which it was obtained, the transformations applied, and the versions used to train each model. The reference documentation artifact is the datasheet for datasets (Gebru et al.), a standardized record answering why a dataset was created, how it was collected, who is represented, what preprocessing occurred, and what uses are appropriate or discouraged. Datasheets pair with model cards (covered in the next section) to give a complete lineage from data to deployed model, and they directly support the EU AI Act's data-governance and technical-documentation obligations for high-risk systems.

Lawful Basis and the Minimization Trade-off

Training data that contains personal information triggers data-protection law. Under the GDPR, controllers must identify a lawful basis (consent, legitimate interests, contract, and so on) for using personal data to train a model, honor purpose limitation (data collected for one purpose cannot be freely repurposed for AI training), and respect special-category protections for sensitive data. Transparency obligations require informing individuals that their data may be used for training, and data-subject rights (access, erasure, objection) must remain operable even once data is embedded in a pipeline.

Data minimization — collecting and retaining only what is necessary — sits in direct tension with the machine-learning intuition that more data yields better models. Good governance resolves this deliberately rather than defaulting to "collect everything." Techniques include using only features with demonstrated predictive value, applying privacy-enhancing technologies such as anonymization, pseudonymization, synthetic data, and differential privacy, and setting retention limits with defined deletion. The organization should document the necessity-and-proportionality judgment: what data is used, why each element is needed, what performance is gained, and what privacy risk is accepted. That record connects data governance back to the impact assessment and forward to release readiness.

Data labeling deserves its own governance attention because labels define the target the model learns to predict, and flawed labels silently cap the ceiling on fairness and accuracy. Governance covers annotator selection and training, clear labeling guidelines, inter-annotator agreement measurement, adjudication of disputed labels, and awareness that a chosen proxy label (for example, "arrested" standing in for "committed a crime," or "clicked" standing in for "found useful") may not measure the true construct of interest — a subtle but consequential source of measurement bias. Two related governance duties round out the stage. First, third-party and web-scraped data carry provenance and licensing risk: teams must confirm they have the right to use the data for training and that it does not embed others' copyrighted or personal content unlawfully. Second, data versioning — snapshotting exactly which data trained which model — is what makes an incident later reproducible and a regulator's questions answerable. Together these practices ensure the data foundation is not just usable but defensible.

IAPP Artificial Intelligence Governance Professional

AIGP

4.3 Data governance: quality, bias & provenance

Key Takeaways

Data as the Root of AI Risk

Data Quality and Representativeness

Sources of Bias

Provenance, Lineage, and Documentation

Lawful Basis and the Minimization Trade-off

IAPP Artificial Intelligence Governance Professional

1Overview & AI Governance Foundations

2Laws & Regulations Applying to AI

3Standards & Frameworks

4Governing AI Development

5Governing AI Deployment & Use

6Operationalizing AI Governance

AIGP

4.3 Data governance: quality, bias & provenance

Key Takeaways

Data as the Root of AI Risk

Data Quality and Representativeness

Sources of Bias

Provenance, Lineage, and Documentation

Lawful Basis and the Minimization Trade-off