3.1 The Medallion Architecture (Bronze, Silver, Gold)
Key Takeaways
- The medallion architecture organizes data into three progressive layers: bronze (raw), silver (cleansed), and gold (business-ready).
- Bronze tables store raw data exactly as ingested from source systems, preserving the original format for reprocessing and auditing.
- Silver tables contain validated, cleansed, deduplicated, and enriched data that provides an enterprise-wide view of business entities.
- Gold tables contain highly aggregated and curated data optimized for specific business use cases, dashboards, and ML features.
- Each layer incrementally improves data quality, and the architecture supports both batch and streaming processing patterns.
The Medallion Architecture (Bronze, Silver, Gold)
Quick Answer: The medallion architecture organizes Lakehouse data into three layers: Bronze (raw ingestion), Silver (cleansed and enriched), and Gold (aggregated business-ready). Data quality improves progressively through each layer, with each hop adding validation, transformation, and business logic.
Architecture Overview
The medallion architecture (also called multi-hop architecture) is the recommended data organization pattern for the Databricks Lakehouse:
| Layer | Purpose | Data Quality | Typical Users |
|---|---|---|---|
| Bronze | Raw data ingestion | Lowest — as-is from source | Data engineers |
| Silver | Cleansed, validated, enriched | Medium — enterprise-ready | Data engineers, analysts |
| Gold | Aggregated, business-ready | Highest — curated for use case | Analysts, executives, ML engineers |
Bronze Layer (Raw)
The bronze layer is the landing zone for all raw data:
Characteristics
- Data is stored exactly as received from source systems
- Minimal or no transformations applied
- Preserves the full fidelity of source data
- Serves as the single source of truth for reprocessing
- Metadata columns added: ingestion timestamp, source file, batch ID
Implementation Pattern
-- Bronze table: raw data with metadata
CREATE TABLE catalog.bronze.raw_orders (
-- Raw data columns (schema from source)
order_data STRING, -- Raw JSON string
-- Metadata columns
_ingest_timestamp TIMESTAMP,
_source_file STRING,
_rescued_data STRING
);
-- Ingest with Auto Loader
-- (typically done via PySpark streaming)
# Bronze ingestion with Auto Loader
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/bronze/orders/schema")
.load("/data/raw/orders/")
.withColumn("_ingest_timestamp", current_timestamp())
.withColumn("_source_file", input_file_name())
.writeStream
.option("checkpointLocation", "/checkpoints/bronze/orders")
.trigger(availableNow=True)
.toTable("catalog.bronze.raw_orders")
)
What Goes in Bronze?
- Raw files from cloud storage (S3, ADLS, GCS)
- CDC (Change Data Capture) events from databases
- Streaming events from Kafka, Event Hubs, Kinesis
- API responses stored as JSON strings
- Log files and telemetry data
Silver Layer (Cleansed)
The silver layer provides a validated, enriched enterprise view of the data:
Transformations Applied
| Transformation | Description | Example |
|---|---|---|
| Schema enforcement | Apply a defined schema to raw data | Parse JSON into typed columns |
| Data cleansing | Fix or remove invalid records | Remove nulls, fix data types |
| Deduplication | Remove duplicate records | MERGE with dedup logic |
| Normalization | Standardize field values | Uppercase state codes, trim whitespace |
| Enrichment | Join with reference data | Add customer name from customer table |
| Filtering | Remove irrelevant records | Drop test/internal records |
Implementation Pattern
-- Silver table: cleansed and typed
CREATE TABLE catalog.silver.orders (
order_id BIGINT,
customer_id BIGINT,
product_id BIGINT,
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(10,2),
order_date DATE,
order_status STRING,
updated_at TIMESTAMP
);
-- Upsert from bronze to silver with deduplication
MERGE INTO catalog.silver.orders AS target
USING (
-- Deduplicate: keep the latest record per order_id
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingest_timestamp DESC) AS rn
FROM catalog.bronze.raw_orders
) WHERE rn = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Gold Layer (Business-Ready)
The gold layer contains curated, aggregated datasets for specific business use cases:
Characteristics
- Data is organized by business domain or use case (not by source system)
- Contains pre-computed aggregations, KPIs, and metrics
- Optimized for query performance (indexed, partitioned, or clustered)
- Directly consumed by BI dashboards, reports, and ML models
Implementation Pattern
-- Gold table: daily sales summary
CREATE TABLE catalog.gold.daily_sales_summary AS
SELECT
o.order_date,
p.product_category,
c.customer_segment,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(o.total_amount) AS total_revenue,
AVG(o.total_amount) AS avg_order_value,
COUNT(DISTINCT o.customer_id) AS unique_customers
FROM catalog.silver.orders o
JOIN catalog.silver.products p ON o.product_id = p.product_id
JOIN catalog.silver.customers c ON o.customer_id = c.customer_id
GROUP BY o.order_date, p.product_category, c.customer_segment;
Design Principles
ELT, Not ETL
In the Lakehouse, the approach is ELT (Extract, Load, Transform):
- Extract data from sources and Load it into Bronze (minimal transformation)
- Transform progressively from Bronze → Silver → Gold
- Complex business logic is applied after data is in the Lakehouse, not before
Why Multi-Hop?
- Reprocessing: If a transformation bug is found, fix the Silver logic and rerun from Bronze
- Traceability: Full lineage from raw source to business metric
- Decoupled concerns: Ingestion team manages Bronze; analytics team manages Gold
- Incremental processing: Each hop can process only new or changed data
On the Exam: The medallion architecture is a foundational concept. Expect questions about which layer is appropriate for a given task, what transformations happen at each layer, and why the architecture uses progressive quality improvement.
A data engineer needs to store raw API responses as JSON strings with ingestion metadata. Which medallion layer should this data be stored in?
Which transformations are typically applied when moving data from bronze to silver?
Why does the Lakehouse follow an ELT approach instead of ETL?