3.5 Data Quality Remediation

Key Takeaways

Deduplicate on the business key (not the whole row) and keep the most recent or authoritative record when duplicates exist.
Handle nulls deliberately: impute a default, carry forward, derive, or reject the row — never silently aggregate over unexplained nulls.
Standardize before you join: fix data types, trim/case-normalize text, and unify formats (dates, codes, units) so keys actually match.
Detect schema drift and outliers early; route bad records to a quarantine/error output rather than failing the whole load.
Validation should run as enforceable rules (constraints, expectations, or check queries) so quality is measured, not assumed.

Last updated: June 2026

Why Data Quality Is Scored Heavily

Within the 45-50% Prepare data domain, several items describe dirty source data and ask for the correct remediation. Wrong answers are usually destructive (drop everything) or naive (ignore nulls); right answers are deliberate and preserve the authoritative record while isolating bad data. The unifying principle the exam rewards is isolate and preserve, do not destroy.

The Core Remediation Patterns

1. Deduplication

Duplicates almost always mean the same business entity arrived more than once, not byte-identical rows. Remediate by:

Identifying the business key (e.g., CustomerId, OrderNumber).
Ranking duplicates by a recency or authority column (load timestamp, source priority).
Keeping one row per key (the latest/authoritative) and discarding or quarantining the rest.

Deduplicating on the entire row misses near-duplicates that differ only in a load timestamp or a trailing space — a common distractor. In Spark this is a Window.partitionBy(businessKey).orderBy(loadTs.desc()) with row_number() = 1; in Power Query it is a group-and-keep-max.

2. Null and Missing Values

Nulls are a design decision, not an accident to ignore:

Strategy	When to use
Impute a default/sentinel	Missing categorical (e.g., `Unknown`)
Forward-fill / interpolate	Time-series gaps where continuity is valid
Derive from other columns	Value is computable
Reject / quarantine the row	Null in a required key or grain column

Never average or sum over unexplained nulls — it silently distorts the measures the semantic model will surface later. A null in a grain or key column means the row cannot be trusted and should be quarantined, not defaulted.

3. Type and Standardization

Before any join or aggregation:

Type conversion — text-typed numbers/dates converted to proper types; failed conversions routed to an error path rather than coerced to null.
Standardization — trim whitespace, normalize case, unify date formats, and canonicalize codes/units (USA/US/United States -> one value).

Keys only match after standardization; a join on inconsistent casing or padded strings silently drops rows and undercounts your facts.

Schema Drift, Outliers, and Late Data

Schema drift — the source added, removed, or renamed columns. Detect and handle it (map explicitly or fail loudly) rather than letting silent column mismatches corrupt the load.
Outliers — values outside plausible ranges (negative quantities, dates in the year 2999). Flag or quarantine; do not auto-delete unless the business rule is certain.
Late-arriving data — events that arrive after their period closed. Use watermarks/incremental logic so late records are reprocessed correctly instead of lost or double-counted.

Validation as Enforceable Rules

Quality must be measured, not assumed. Implement validation as explicit rules:

Constraints / not-null / uniqueness checks in the warehouse.
Expectation or check queries in notebooks/pipelines (row counts, key uniqueness, referential integrity).
Quarantine / error output so failing records land in a separate table for review while good records flow through — the exam strongly prefers this over failing the entire pipeline.

Quality Rules vs. Anti-Patterns

Good (exam-preferred)	Anti-pattern (distractor)
Quarantine failing rows, load the rest	Fail the whole pipeline on any bad row
Dedup on business key, keep authoritative	Delete only byte-identical rows
Default/derive nulls per column rule	Aggregate silently over nulls
Map/handle schema drift explicitly	Let column mismatches load silently

Remediation Decision List

Profile first — row counts, distinct keys, null rates, value ranges.
Standardize types and formats.
Deduplicate on the business key, keeping the authoritative row.
Resolve nulls per column strategy.
Validate with rules; route failures to quarantine.
Load clean data; report quality metrics.

The consistent exam principle: quarantining bad rows keeps the pipeline running and keeps an audit trail, which beats both "drop the rows" and "fail the load."

Where Each Remediation Lives in Fabric

DP-600 also tests where you implement these patterns. Match the cleaning step to the engine:

Dataflow Gen2 (Power Query) — interactive profiling (column quality, distribution, and profile cards), Trim, Clean, type changes, replace-values, and remove-duplicates for low-code analysts.
Spark notebook — dropDuplicates, window-function dedup, na.fill/na.drop, regex standardization, and writing rejected rows to a separate quarantine Delta table at scale.
Warehouse T-SQL — NOT NULL and UNIQUE constraints, MERGE for upserts, and check queries (SELECT COUNT(*) ... HAVING COUNT(*) > 1) for key uniqueness.

A Worked Remediation Scenario

A daily product feed arrives with mixed casing (usa vs USA), padded SKUs (" A100 "), some null Category values, and occasional resends. The correct pipeline: trim and upper-case the country and SKU so keys match; default null Category to Unknown because it is descriptive, not a grain column; deduplicate on SKU keeping the latest LoadTimestamp; then validate SKU uniqueness and route any row still failing to a quarantine table. Only after those steps do clean rows merge into the product dimension, and a row count of quarantined records is logged as a quality metric.

This sequence — standardize, dedup, resolve nulls, validate, quarantine, load — is the pattern the exam expects you to recognize and reproduce.

Test Your Knowledge

A daily customer feed contains multiple rows for some customers because the source resends records. Each row has a LoadTimestamp. Which remediation is most appropriate before loading the customer dimension?

Delete every row that is not perfectly identical to another row

Deduplicate on the CustomerId business key, keeping the row with the latest LoadTimestamp

Keep all rows so no data is lost and aggregate them later

Fail the pipeline whenever any duplicate CustomerId is detected

Test Your Knowledge

A pipeline loads a warehouse fact table. Occasionally a few source rows have a null in the mandatory ProductKey grain column. The team wants the load to keep running and to retain bad rows for later investigation. What is the best approach?

Default the null ProductKey to 0 and aggregate the rows into the fact table

Abort the entire pipeline run whenever any null ProductKey appears

Route rows with null ProductKey to a quarantine/error table and load the valid rows

Drop the rows silently so no error is reported

Up Next

3.6 Querying: Visual Query, SQL, KQL, DAX

Continue learning

Exam DP-600: Implementing Analytics Solutions Using Microsoft Fabric

Azure DP-600

3.5 Data Quality Remediation

Key Takeaways

Why Data Quality Is Scored Heavily

The Core Remediation Patterns

1. Deduplication

2. Null and Missing Values

3. Type and Standardization

Schema Drift, Outliers, and Late Data

Validation as Enforceable Rules

Quality Rules vs. Anti-Patterns

Remediation Decision List

Where Each Remediation Lives in Fabric

A Worked Remediation Scenario

Exam DP-600: Implementing Analytics Solutions Using Microsoft Fabric

1DP-600 Exam Overview & Fabric Foundations

2Maintain a Data Analytics Solution (25-30%)

3Prepare Data (45-50%)

4Implement & Manage Semantic Models (25-30%)

5Exam Strategy & Final Preparation

Azure DP-600

3.5 Data Quality Remediation

Key Takeaways

Why Data Quality Is Scored Heavily

The Core Remediation Patterns

1. Deduplication

2. Null and Missing Values

3. Type and Standardization

Schema Drift, Outliers, and Late Data

Validation as Enforceable Rules

Quality Rules vs. Anti-Patterns

Remediation Decision List

Where Each Remediation Lives in Fabric

A Worked Remediation Scenario