A data engineer needs to store raw API responses as JSON strings with ingestion metadata. Which medallion layer should this data be stored in?

Bronze layer, because it preserves raw data as-is from the source. The bronze layer stores raw data exactly as received from source systems. Raw API responses with ingestion metadata (timestamp, source file) belong in bronze. Parsing and cleansing happen when moving to the silver layer.

Which transformations are typically applied when moving data from bronze to silver?

Schema enforcement, deduplication, data cleansing, and enrichment with reference data. The bronze-to-silver transformation applies schema enforcement (parsing raw data into typed columns), deduplication (removing duplicates), data cleansing (fixing invalid values), and enrichment (joining with reference data). Aggregations for dashboards happen at the gold layer.

Why does the Lakehouse follow an ELT approach instead of ETL?

ELT allows data to be loaded first and transformed progressively, enabling reprocessing from raw data if needed. ELT loads raw data first (into Bronze), then transforms progressively (Silver, Gold). This preserves the raw data for reprocessing if transformation logic needs to be corrected, and decouples ingestion from transformation.

The Medallion Architecture (Bronze, Silver, Gold)

Quick Answer: The medallion architecture organizes Lakehouse data into three layers: Bronze (raw ingestion), Silver (cleansed and enriched), and Gold (aggregated business-ready). Data quality improves progressively through each layer, with each hop adding validation, transformation, and business logic.

Architecture Overview

The medallion architecture (also called multi-hop architecture) is the recommended data organization pattern for the Databricks Lakehouse:

Layer	Purpose	Data Quality	Typical Users
Bronze	Raw data ingestion	Lowest — as-is from source	Data engineers
Silver	Cleansed, validated, enriched	Medium — enterprise-ready	Data engineers, analysts
Gold	Aggregated, business-ready	Highest — curated for use case	Analysts, executives, ML engineers

Bronze Layer (Raw)

The bronze layer is the landing zone for all raw data:

Characteristics

Data is stored exactly as received from source systems
Minimal or no transformations applied
Preserves the full fidelity of source data
Serves as the single source of truth for reprocessing
Metadata columns added: ingestion timestamp, source file, batch ID

Implementation Pattern

-- Bronze table: raw data with metadata
CREATE TABLE catalog.bronze.raw_orders (
    -- Raw data columns (schema from source)
    order_data STRING,  -- Raw JSON string

    -- Metadata columns
    _ingest_timestamp TIMESTAMP,
    _source_file STRING,
    _rescued_data STRING
);

-- Ingest with Auto Loader
-- (typically done via PySpark streaming)

# Bronze ingestion with Auto Loader
(spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/checkpoints/bronze/orders/schema")
    .load("/data/raw/orders/")
    .withColumn("_ingest_timestamp", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .writeStream
    .option("checkpointLocation", "/checkpoints/bronze/orders")
    .trigger(availableNow=True)
    .toTable("catalog.bronze.raw_orders")
)

What Goes in Bronze?

Raw files from cloud storage (S3, ADLS, GCS)
CDC (Change Data Capture) events from databases
Streaming events from Kafka, Event Hubs, Kinesis
API responses stored as JSON strings
Log files and telemetry data

Silver Layer (Cleansed)

The silver layer provides a validated, enriched enterprise view of the data:

Transformations Applied

Transformation	Description	Example
Schema enforcement	Apply a defined schema to raw data	Parse JSON into typed columns
Data cleansing	Fix or remove invalid records	Remove nulls, fix data types
Deduplication	Remove duplicate records	MERGE with dedup logic
Normalization	Standardize field values	Uppercase state codes, trim whitespace
Enrichment	Join with reference data	Add customer name from customer table
Filtering	Remove irrelevant records	Drop test/internal records

Implementation Pattern

-- Silver table: cleansed and typed
CREATE TABLE catalog.silver.orders (
    order_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2),
    order_date DATE,
    order_status STRING,
    updated_at TIMESTAMP
);

-- Upsert from bronze to silver with deduplication
MERGE INTO catalog.silver.orders AS target
USING (
    -- Deduplicate: keep the latest record per order_id
    SELECT * FROM (
        SELECT *,
            ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingest_timestamp DESC) AS rn
        FROM catalog.bronze.raw_orders
    ) WHERE rn = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Gold Layer (Business-Ready)

The gold layer contains curated, aggregated datasets for specific business use cases:

Characteristics

Data is organized by business domain or use case (not by source system)
Contains pre-computed aggregations, KPIs, and metrics
Optimized for query performance (indexed, partitioned, or clustered)
Directly consumed by BI dashboards, reports, and ML models

Implementation Pattern

-- Gold table: daily sales summary
CREATE TABLE catalog.gold.daily_sales_summary AS
SELECT
    o.order_date,
    p.product_category,
    c.customer_segment,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_revenue,
    AVG(o.total_amount) AS avg_order_value,
    COUNT(DISTINCT o.customer_id) AS unique_customers
FROM catalog.silver.orders o
JOIN catalog.silver.products p ON o.product_id = p.product_id
JOIN catalog.silver.customers c ON o.customer_id = c.customer_id
GROUP BY o.order_date, p.product_category, c.customer_segment;

Design Principles

ELT, Not ETL

In the Lakehouse, the approach is ELT (Extract, Load, Transform):

Extract data from sources and Load it into Bronze (minimal transformation)
Transform progressively from Bronze → Silver → Gold
Complex business logic is applied after data is in the Lakehouse, not before

Why Multi-Hop?

Reprocessing: If a transformation bug is found, fix the Silver logic and rerun from Bronze
Traceability: Full lineage from raw source to business metric
Decoupled concerns: Ingestion team manages Bronze; analytics team manages Gold
Incremental processing: Each hop can process only new or changed data

On the Exam: The medallion architecture is a foundational concept. Expect questions about which layer is appropriate for a given task, what transformations happen at each layer, and why the architecture uses progressive quality improvement.

Databricks Certified Data Engineer Associate

3.1 The Medallion Architecture (Bronze, Silver, Gold)

Key Takeaways

The Medallion Architecture (Bronze, Silver, Gold)

Architecture Overview

Bronze Layer (Raw)

Characteristics

Implementation Pattern

What Goes in Bronze?

Silver Layer (Cleansed)

Transformations Applied

Implementation Pattern

Gold Layer (Business-Ready)

Characteristics

Implementation Pattern

Design Principles

ELT, Not ETL

Why Multi-Hop?

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

3.1 The Medallion Architecture (Bronze, Silver, Gold)

Key Takeaways

The Medallion Architecture (Bronze, Silver, Gold)

Architecture Overview

Bronze Layer (Raw)

Characteristics

Implementation Pattern

What Goes in Bronze?

Silver Layer (Cleansed)

Transformations Applied

Implementation Pattern

Gold Layer (Business-Ready)

Characteristics

Implementation Pattern

Design Principles

ELT, Not ETL

Why Multi-Hop?