3.1 The Medallion Architecture (Bronze, Silver, Gold)
Key Takeaways
- The medallion architecture organizes data into three progressive quality layers: bronze (raw), silver (validated/cleansed), and gold (business-ready aggregates).
- Bronze tables capture source data exactly as ingested, typically append-only, preserving the raw record so the pipeline can be reprocessed or audited later.
- Silver tables enforce schema, deduplicate, cast types, join, and apply quality rules to produce a conformed enterprise view of business entities.
- Gold tables hold curated, highly aggregated data modeled for dashboards, reporting, and ML features — often one gold table per use case.
- Each hop incrementally raises data quality, and the same layered design supports both batch and streaming ingestion in the lakehouse.
What the Medallion Architecture Solves
The medallion architecture (also called the multi-hop architecture) is a data design pattern that incrementally improves the structure and quality of data as it flows through successive layers in the lakehouse. Each layer is a set of Delta Lake tables, and data moves "hop by hop" from raw to refined. The exam expects you to identify which layer a given table belongs to from its characteristics, not just to recite definitions.
The pattern exists because raw source data is messy: it has malformed records, duplicates, inconsistent types, and missing values. Rather than fixing everything in one giant transformation, you isolate concerns across three named layers. This makes pipelines easier to debug, lets you reprocess from a known-good intermediate state, and gives different consumers a table at the right quality level for their needs.
| Layer | Also called | Quality | Typical write mode | Primary consumer |
|---|---|---|---|---|
| Bronze | Raw / landing | Lowest — as ingested | Append-only | Data engineers, reprocessing |
| Silver | Cleansed / conformed | Validated, deduplicated | Merge/upsert | Analysts, data scientists |
| Gold | Curated / aggregated | Highest — business-ready | Overwrite / merge | BI dashboards, ML, executives |
Bronze Layer: Raw Ingestion
The bronze layer stores data exactly as it arrives from source systems — JSON, CSV, Parquet, CDC feeds, IoT events — with little or no transformation. Bronze tables are almost always append-only and frequently add metadata columns such as the ingestion timestamp (_ingest_time), source file name (_metadata.file_path from Auto Loader), and processing date. Preserving the unmodified raw record is the whole point: if a downstream bug is discovered, you can replay everything from bronze without re-pulling from the (possibly transient) source.
Because bronze is the system of record for raw data, you generally do not delete or update bronze rows. Schema is kept permissive — Auto Loader can even rescue unexpected columns into a _rescued_data column rather than failing the load.
Silver Layer: Cleansed and Conformed
The silver layer is where data becomes trustworthy. Transformations applied here typically include:
- Schema enforcement and explicit type casting (string timestamps to real
TIMESTAMP, etc.). - Deduplication of records that arrived multiple times.
- Null handling, filtering of invalid rows, and applying data-quality expectations.
- Joins and enrichment that combine multiple bronze sources into conformed business entities (a single
customersorordersview across systems).
Silver gives the organization a clean, queryable, enterprise-wide view. It is detailed (row-level) rather than aggregated, so analysts and data scientists can slice it however they need. Silver tables are commonly written with MERGE INTO (upsert) so that updates and late-arriving changes are reflected correctly.
Gold Layer: Business-Ready Aggregates
The gold layer contains highly refined, aggregated data modeled for a specific business purpose. Where silver is general-purpose, gold is purpose-built: a daily-revenue-by-region table, a churn-feature table for an ML model, or a star-schema fact/dimension set for a BI tool. Gold tables apply the heavy GROUP BY, window functions, and joins that power dashboards, so the BI layer reads pre-computed results instead of scanning raw data.
A single silver table often feeds many gold tables, each shaped for one consumer. Because gold is derived, it can be rebuilt from silver at any time, which is why materialized views are a natural fit for gold in declarative pipelines.
Why the layering matters on the exam
- Quality increases left to right — never the reverse. You don't write "cleaned" data back into bronze.
- Each hop is reproducible — you can truncate and rebuild silver/gold from the upstream layer.
- Batch and streaming both fit — a bronze table can be a streaming table fed by Auto Loader while gold is a batch-refreshed materialized view.
- Don't skip layers in answer choices that propose dashboards reading directly from bronze; that violates the pattern's intent.
How the Layers Interact in Practice
A realistic lakehouse rarely has just three tables; it has many bronze ingests, several conformed silver entities, and dozens of gold marts. The medallion pattern keeps this manageable because each table has a single, well-understood quality contract. Consider a retail example:
- Bronze holds
bronze_orders,bronze_inventory, andbronze_clickstream, each an append-only landing table fed by Auto Loader from a different source. - Silver produces
silver_orders(typed, deduplicated, with currency normalized) andsilver_customers(a conformed view joining CRM and billing sources). Silver is the layer most users actually query for ad-hoc analysis. - Gold builds
gold_daily_revenue,gold_customer_360, and agold_churn_featurestable for the ML team — each derived from silver.
Common exam misconceptions
- Gold is not always the smallest. It can be large if it stores fine-grained features; "gold" denotes purpose-built and curated, not small.
- Silver is row-level, not aggregated. Aggregation is gold's job; a silver table that is already grouped is a design smell.
- Bronze is not throwaway. It is the durable raw record that makes the whole pipeline replayable, so it is retained, not discarded after silver loads.
- The medallion pattern is a convention, not a Databricks-enforced feature — Delta tables are just Delta tables; the layering is how you organize them. The exam still expects you to apply the convention correctly.
A pipeline ingests raw JSON click events directly from cloud storage and writes them, untransformed, to a Delta table with an added ingestion-timestamp column. Which medallion layer is this table?
Which transformation is MOST characteristic of moving data from the silver layer to the gold layer?
Why is the bronze layer typically kept append-only and minimally transformed?