5.6 Data Quality Best Practices

Key Takeaways

  • Data quality should be enforced at every layer of the medallion architecture using different strategies appropriate to each layer.
  • Bronze layer quality focuses on completeness (all data arrives) and adds metadata for traceability.
  • Silver layer quality enforces schema, removes duplicates, validates data types, and applies business rules.
  • Gold layer quality validates aggregation correctness, referential integrity, and business metric consistency.
  • Combine Lakeflow expectations, SQL assertions, and custom monitoring for a comprehensive data quality strategy.
Last updated: March 2026

Data Quality Best Practices

Quick Answer: Enforce data quality at every medallion layer: completeness at bronze, validation and cleansing at silver, and business rule correctness at gold. Use Lakeflow expectations for pipeline-level enforcement, SQL constraints for table-level rules, and SQL Alerts for monitoring.

Quality by Medallion Layer

Bronze Layer Quality

Quality DimensionStrategy
CompletenessVerify all expected source files arrived
TraceabilityAdd ingestion timestamp, source file path, batch ID
PreservationStore raw data as-is (no transformations that could lose data)
Rescued dataUse Auto Loader _rescued_data column for unparseable records
-- Bronze: expectations for data arrival
CREATE OR REFRESH STREAMING TABLE bronze_orders (
    -- Warn if rescued data exists (indicates source schema changes)
    CONSTRAINT no_rescued_data EXPECT (_rescued_data IS NULL)
)
AS SELECT * FROM STREAM read_files('/data/raw/orders/');

Silver Layer Quality

Quality DimensionStrategy
ValidityData types are correct, values are in expected ranges
UniquenessNo duplicate records for a given key
ConsistencyReferential integrity with other tables
CompletenessRequired fields are not null
TimelinessData arrives within expected time windows
-- Silver: comprehensive quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
    -- Required fields
    CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
    CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW,

    -- Range validation
    CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
    CONSTRAINT valid_quantity EXPECT (quantity BETWEEN 1 AND 10000) ON VIOLATION DROP ROW,

    -- Business rules
    CONSTRAINT valid_status EXPECT (order_status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled')) ON VIOLATION DROP ROW,

    -- Date validation
    CONSTRAINT valid_date EXPECT (order_date >= '2020-01-01' AND order_date <= current_date()) ON VIOLATION DROP ROW
)
AS SELECT ... FROM STREAM(LIVE.bronze_orders);

Gold Layer Quality

Quality DimensionStrategy
CorrectnessAggregations produce expected results
ConsistencyMetrics match across related tables
FreshnessData is updated within SLA
CoverageAll expected dimensions are represented
-- Gold: validation queries (run as post-processing checks)

-- Check: daily total matches sum of individual orders
SELECT
    order_date,
    ABS(gold_total - silver_total) AS discrepancy
FROM (
    SELECT order_date, total_revenue AS gold_total FROM gold.daily_sales
) g
JOIN (
    SELECT order_date, SUM(amount) AS silver_total FROM silver.orders GROUP BY order_date
) s ON g.order_date = s.order_date
WHERE ABS(gold_total - silver_total) > 0.01;
-- Expected: 0 rows (no discrepancies)

Data Quality Monitoring Strategy

Layered Approach

LayerToolEnforcement
Pipeline levelLakeflow expectationsWarn, drop, or fail on each record
Table levelDelta Lake constraintsCHECK and NOT NULL constraints
Job levelSQL assertion tasksPost-processing validation queries
Platform levelSQL AlertsContinuous monitoring with notifications

Delta Lake Table Constraints

-- Add NOT NULL constraint
ALTER TABLE silver.orders ALTER COLUMN order_id SET NOT NULL;

-- Add CHECK constraint
ALTER TABLE silver.orders ADD CONSTRAINT positive_amount CHECK (amount > 0);

-- These constraints are enforced on every write — violations fail the write

SQL Assertion Tasks (Job-Level Checks)

-- Add as a task in Databricks Workflows
-- The task fails if the query returns any rows (indicating quality issues)

SELECT 'DUPLICATE_ORDERS' AS issue, COUNT(*) AS count
FROM (
    SELECT order_id FROM silver.orders GROUP BY order_id HAVING COUNT(*) > 1
)
UNION ALL
SELECT 'NULL_AMOUNTS' AS issue, COUNT(*) AS count
FROM silver.orders WHERE amount IS NULL
UNION ALL
SELECT 'FUTURE_DATES' AS issue, COUNT(*) AS count
FROM silver.orders WHERE order_date > current_date();

Common Data Quality Metrics

MetricCalculationTarget
Null rateCOUNT(nulls) / COUNT(*) per column< 1% for required fields
Duplicate rateCOUNT(*) - COUNT(DISTINCT key)0
Freshnesscurrent_timestamp() - MAX(event_time)Within SLA
Expectation pass ratepassed / (passed + failed)> 99%
Volume varianceToday's count vs. 7-day averageWithin 2 std devs

On the Exam: Understand the layered quality approach — expectations in pipelines, constraints on tables, assertions in jobs, and alerts for monitoring. Know which layer (bronze/silver/gold) is appropriate for each type of quality check.

Test Your Knowledge

At which medallion layer should deduplication of records be applied?

A
B
C
D
Test Your Knowledge

A data engineer adds a CHECK constraint to a Delta table: ALTER TABLE orders ADD CONSTRAINT positive_amount CHECK (amount > 0). What happens when a write attempts to insert a row with amount = -5?

A
B
C
D
Test Your Knowledge

Which combination of tools provides the most comprehensive data quality strategy in Databricks?

A
B
C
D