3.5 Data Quality Remediation
Key Takeaways
- Deduplicate on the business key (not the whole row) and keep the most recent or authoritative record when duplicates exist.
- Handle nulls deliberately: impute a default, carry forward, derive, or reject the row — never silently aggregate over unexplained nulls.
- Standardize before you join: fix data types, trim/case-normalize text, and unify formats (dates, codes, units) so keys actually match.
- Detect schema drift and outliers early; route bad records to a quarantine/error output rather than failing the whole load.
- Validation should run as enforceable rules (constraints, expectations, or check queries) so quality is measured, not assumed.
Why Data Quality Is Scored Heavily
Within the 45-50% Prepare data domain, several items describe dirty source data and ask for the correct remediation. Wrong answers are usually destructive (drop everything) or naive (ignore nulls); right answers are deliberate and preserve the authoritative record while isolating bad data.
The Core Remediation Patterns
1. Deduplication
Duplicates almost always mean the same business entity arrived more than once, not identical rows. Remediate by:
- Identifying the business key (e.g.,
CustomerId,OrderNumber). - Ranking duplicates by a recency or authority column (load timestamp, source priority).
- Keeping one row per key (the latest/authoritative) and discarding or quarantining the rest.
Deduplicating on the entire row misses near-duplicates that differ only in a load timestamp — a common distractor.
2. Null and Missing Values
Nulls are a design decision, not an accident to ignore:
| Strategy | When to use |
|---|---|
| Impute a default/sentinel | Missing categorical (e.g., Unknown) |
| Forward-fill / interpolate | Time-series gaps where continuity is valid |
| Derive from other columns | Value is computable |
| Reject / quarantine the row | Null in a required key or grain column |
Never average or sum over unexplained nulls — it silently distorts measures the semantic model will surface later.
3. Type and Standardization
Before any join or aggregation:
- Type conversion — text-typed numbers/dates converted to proper types; failed conversions routed to an error path.
- Standardization — trim whitespace, normalize case, unify date formats, canonicalize codes/units (
USA/US/United States→ one value).
Keys only match after standardization; a join on inconsistent casing or padded strings silently drops rows.
Schema Drift, Outliers, and Late Data
- Schema drift — the source added, removed, or renamed columns. Detect and handle it (map or fail explicitly) rather than letting silent column mismatches corrupt the load.
- Outliers — values outside plausible ranges (negative quantities, impossible dates). Flag or quarantine; do not auto-delete unless the rule is certain.
- Late-arriving data — events that arrive after their period closed. Use watermarks/incremental logic so late records are reprocessed correctly instead of lost.
Validation as Enforceable Rules
Quality must be measured, not assumed. Implement validation as explicit rules:
- Constraints / not-null / uniqueness checks in the warehouse.
- Expectation or check queries in notebooks/pipelines (row counts, key uniqueness, referential integrity).
- Quarantine / error output so failing records land in a separate table for review while good records flow through — the exam strongly prefers this over failing the entire pipeline.
Remediation Decision List
- Profile first — counts, distinct keys, null rates, value ranges.
- Standardize types and formats.
- Deduplicate on the business key, keeping the authoritative row.
- Resolve nulls per column strategy.
- Validate with rules; route failures to quarantine.
- Load clean data; report quality metrics.
The consistent exam principle: isolate and preserve, do not destroy. Quarantining bad rows keeps the pipeline running and keeps an audit trail, which beats both "drop the rows" and "fail the load."
A daily customer feed contains multiple rows for some customers because the source resends records. Each row has a LoadTimestamp. Which remediation is most appropriate before loading the customer dimension?