5.6 Data Quality Best Practices
Key Takeaways
- Data quality should be enforced at every layer of the medallion architecture using different strategies appropriate to each layer.
- Bronze layer quality focuses on completeness (all data arrives) and adds metadata for traceability.
- Silver layer quality enforces schema, removes duplicates, validates data types, and applies business rules.
- Gold layer quality validates aggregation correctness, referential integrity, and business metric consistency.
- Combine Lakeflow expectations, SQL assertions, and custom monitoring for a comprehensive data quality strategy.
Data Quality Best Practices
Quick Answer: Enforce data quality at every medallion layer: completeness at bronze, validation and cleansing at silver, and business rule correctness at gold. Use Lakeflow expectations for pipeline-level enforcement, SQL constraints for table-level rules, and SQL Alerts for monitoring.
Quality by Medallion Layer
Bronze Layer Quality
| Quality Dimension | Strategy |
|---|---|
| Completeness | Verify all expected source files arrived |
| Traceability | Add ingestion timestamp, source file path, batch ID |
| Preservation | Store raw data as-is (no transformations that could lose data) |
| Rescued data | Use Auto Loader _rescued_data column for unparseable records |
-- Bronze: expectations for data arrival
CREATE OR REFRESH STREAMING TABLE bronze_orders (
-- Warn if rescued data exists (indicates source schema changes)
CONSTRAINT no_rescued_data EXPECT (_rescued_data IS NULL)
)
AS SELECT * FROM STREAM read_files('/data/raw/orders/');
Silver Layer Quality
| Quality Dimension | Strategy |
|---|---|
| Validity | Data types are correct, values are in expected ranges |
| Uniqueness | No duplicate records for a given key |
| Consistency | Referential integrity with other tables |
| Completeness | Required fields are not null |
| Timeliness | Data arrives within expected time windows |
-- Silver: comprehensive quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
-- Required fields
CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW,
-- Range validation
CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
CONSTRAINT valid_quantity EXPECT (quantity BETWEEN 1 AND 10000) ON VIOLATION DROP ROW,
-- Business rules
CONSTRAINT valid_status EXPECT (order_status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled')) ON VIOLATION DROP ROW,
-- Date validation
CONSTRAINT valid_date EXPECT (order_date >= '2020-01-01' AND order_date <= current_date()) ON VIOLATION DROP ROW
)
AS SELECT ... FROM STREAM(LIVE.bronze_orders);
Gold Layer Quality
| Quality Dimension | Strategy |
|---|---|
| Correctness | Aggregations produce expected results |
| Consistency | Metrics match across related tables |
| Freshness | Data is updated within SLA |
| Coverage | All expected dimensions are represented |
-- Gold: validation queries (run as post-processing checks)
-- Check: daily total matches sum of individual orders
SELECT
order_date,
ABS(gold_total - silver_total) AS discrepancy
FROM (
SELECT order_date, total_revenue AS gold_total FROM gold.daily_sales
) g
JOIN (
SELECT order_date, SUM(amount) AS silver_total FROM silver.orders GROUP BY order_date
) s ON g.order_date = s.order_date
WHERE ABS(gold_total - silver_total) > 0.01;
-- Expected: 0 rows (no discrepancies)
Data Quality Monitoring Strategy
Layered Approach
| Layer | Tool | Enforcement |
|---|---|---|
| Pipeline level | Lakeflow expectations | Warn, drop, or fail on each record |
| Table level | Delta Lake constraints | CHECK and NOT NULL constraints |
| Job level | SQL assertion tasks | Post-processing validation queries |
| Platform level | SQL Alerts | Continuous monitoring with notifications |
Delta Lake Table Constraints
-- Add NOT NULL constraint
ALTER TABLE silver.orders ALTER COLUMN order_id SET NOT NULL;
-- Add CHECK constraint
ALTER TABLE silver.orders ADD CONSTRAINT positive_amount CHECK (amount > 0);
-- These constraints are enforced on every write — violations fail the write
SQL Assertion Tasks (Job-Level Checks)
-- Add as a task in Databricks Workflows
-- The task fails if the query returns any rows (indicating quality issues)
SELECT 'DUPLICATE_ORDERS' AS issue, COUNT(*) AS count
FROM (
SELECT order_id FROM silver.orders GROUP BY order_id HAVING COUNT(*) > 1
)
UNION ALL
SELECT 'NULL_AMOUNTS' AS issue, COUNT(*) AS count
FROM silver.orders WHERE amount IS NULL
UNION ALL
SELECT 'FUTURE_DATES' AS issue, COUNT(*) AS count
FROM silver.orders WHERE order_date > current_date();
Common Data Quality Metrics
| Metric | Calculation | Target |
|---|---|---|
| Null rate | COUNT(nulls) / COUNT(*) per column | < 1% for required fields |
| Duplicate rate | COUNT(*) - COUNT(DISTINCT key) | 0 |
| Freshness | current_timestamp() - MAX(event_time) | Within SLA |
| Expectation pass rate | passed / (passed + failed) | > 99% |
| Volume variance | Today's count vs. 7-day average | Within 2 std devs |
On the Exam: Understand the layered quality approach — expectations in pipelines, constraints on tables, assertions in jobs, and alerts for monitoring. Know which layer (bronze/silver/gold) is appropriate for each type of quality check.
At which medallion layer should deduplication of records be applied?
A data engineer adds a CHECK constraint to a Delta table: ALTER TABLE orders ADD CONSTRAINT positive_amount CHECK (amount > 0). What happens when a write attempts to insert a row with amount = -5?
Which combination of tools provides the most comprehensive data quality strategy in Databricks?