3.1 The Medallion Architecture (Bronze, Silver, Gold)

Key Takeaways

  • The medallion architecture organizes data into three progressive layers: bronze (raw), silver (cleansed), and gold (business-ready).
  • Bronze tables store raw data exactly as ingested from source systems, preserving the original format for reprocessing and auditing.
  • Silver tables contain validated, cleansed, deduplicated, and enriched data that provides an enterprise-wide view of business entities.
  • Gold tables contain highly aggregated and curated data optimized for specific business use cases, dashboards, and ML features.
  • Each layer incrementally improves data quality, and the architecture supports both batch and streaming processing patterns.
Last updated: March 2026

The Medallion Architecture (Bronze, Silver, Gold)

Quick Answer: The medallion architecture organizes Lakehouse data into three layers: Bronze (raw ingestion), Silver (cleansed and enriched), and Gold (aggregated business-ready). Data quality improves progressively through each layer, with each hop adding validation, transformation, and business logic.

Architecture Overview

The medallion architecture (also called multi-hop architecture) is the recommended data organization pattern for the Databricks Lakehouse:

LayerPurposeData QualityTypical Users
BronzeRaw data ingestionLowest — as-is from sourceData engineers
SilverCleansed, validated, enrichedMedium — enterprise-readyData engineers, analysts
GoldAggregated, business-readyHighest — curated for use caseAnalysts, executives, ML engineers

Bronze Layer (Raw)

The bronze layer is the landing zone for all raw data:

Characteristics

  • Data is stored exactly as received from source systems
  • Minimal or no transformations applied
  • Preserves the full fidelity of source data
  • Serves as the single source of truth for reprocessing
  • Metadata columns added: ingestion timestamp, source file, batch ID

Implementation Pattern

-- Bronze table: raw data with metadata
CREATE TABLE catalog.bronze.raw_orders (
    -- Raw data columns (schema from source)
    order_data STRING,  -- Raw JSON string

    -- Metadata columns
    _ingest_timestamp TIMESTAMP,
    _source_file STRING,
    _rescued_data STRING
);

-- Ingest with Auto Loader
-- (typically done via PySpark streaming)
# Bronze ingestion with Auto Loader
(spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/bronze/orders/schema")
    .load("/data/raw/orders/")
    .withColumn("_ingest_timestamp", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .writeStream
    .option("checkpointLocation", "/checkpoints/bronze/orders")
    .trigger(availableNow=True)
    .toTable("catalog.bronze.raw_orders")
)

What Goes in Bronze?

  • Raw files from cloud storage (S3, ADLS, GCS)
  • CDC (Change Data Capture) events from databases
  • Streaming events from Kafka, Event Hubs, Kinesis
  • API responses stored as JSON strings
  • Log files and telemetry data

Silver Layer (Cleansed)

The silver layer provides a validated, enriched enterprise view of the data:

Transformations Applied

TransformationDescriptionExample
Schema enforcementApply a defined schema to raw dataParse JSON into typed columns
Data cleansingFix or remove invalid recordsRemove nulls, fix data types
DeduplicationRemove duplicate recordsMERGE with dedup logic
NormalizationStandardize field valuesUppercase state codes, trim whitespace
EnrichmentJoin with reference dataAdd customer name from customer table
FilteringRemove irrelevant recordsDrop test/internal records

Implementation Pattern

-- Silver table: cleansed and typed
CREATE TABLE catalog.silver.orders (
    order_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2),
    order_date DATE,
    order_status STRING,
    updated_at TIMESTAMP
);

-- Upsert from bronze to silver with deduplication
MERGE INTO catalog.silver.orders AS target
USING (
    -- Deduplicate: keep the latest record per order_id
    SELECT * FROM (
        SELECT *,
            ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingest_timestamp DESC) AS rn
        FROM catalog.bronze.raw_orders
    ) WHERE rn = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Gold Layer (Business-Ready)

The gold layer contains curated, aggregated datasets for specific business use cases:

Characteristics

  • Data is organized by business domain or use case (not by source system)
  • Contains pre-computed aggregations, KPIs, and metrics
  • Optimized for query performance (indexed, partitioned, or clustered)
  • Directly consumed by BI dashboards, reports, and ML models

Implementation Pattern

-- Gold table: daily sales summary
CREATE TABLE catalog.gold.daily_sales_summary AS
SELECT
    o.order_date,
    p.product_category,
    c.customer_segment,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_revenue,
    AVG(o.total_amount) AS avg_order_value,
    COUNT(DISTINCT o.customer_id) AS unique_customers
FROM catalog.silver.orders o
JOIN catalog.silver.products p ON o.product_id = p.product_id
JOIN catalog.silver.customers c ON o.customer_id = c.customer_id
GROUP BY o.order_date, p.product_category, c.customer_segment;

Design Principles

ELT, Not ETL

In the Lakehouse, the approach is ELT (Extract, Load, Transform):

  1. Extract data from sources and Load it into Bronze (minimal transformation)
  2. Transform progressively from Bronze → Silver → Gold
  3. Complex business logic is applied after data is in the Lakehouse, not before

Why Multi-Hop?

  • Reprocessing: If a transformation bug is found, fix the Silver logic and rerun from Bronze
  • Traceability: Full lineage from raw source to business metric
  • Decoupled concerns: Ingestion team manages Bronze; analytics team manages Gold
  • Incremental processing: Each hop can process only new or changed data

On the Exam: The medallion architecture is a foundational concept. Expect questions about which layer is appropriate for a given task, what transformations happen at each layer, and why the architecture uses progressive quality improvement.

Loading diagram...
Medallion Architecture Data Flow
Test Your Knowledge

A data engineer needs to store raw API responses as JSON strings with ingestion metadata. Which medallion layer should this data be stored in?

A
B
C
D
Test Your Knowledge

Which transformations are typically applied when moving data from bronze to silver?

A
B
C
D
Test Your Knowledge

Why does the Lakehouse follow an ELT approach instead of ETL?

A
B
C
D