3.6 Implementing a Multi-Hop Pipeline
Key Takeaways
- A multi-hop pipeline implements the medallion architecture as one declarative pipeline of bronze, silver, and gold datasets with automatic dependency resolution.
- Bronze ingests raw files with Auto Loader via read_files (SQL) or cloudFiles (Python) into an append-only streaming table.
- Silver applies expectations, type casting, deduplication, and STREAM() reads to produce a cleansed streaming table.
- Gold builds business aggregations and KPIs as materialized views using JOINs and GROUP BY over silver.
- The whole bronze-to-gold flow is declared once and executed as a single managed unit, with quality metrics in the event log.
Putting Medallion + Declarative Pipelines Together
A multi-hop pipeline is the medallion architecture expressed as a single Lakeflow Declarative Pipeline. Instead of three separate jobs, you declare bronze, silver, and gold datasets in one pipeline and let the framework resolve their order and run them as a managed unit. This is the most-tested end-to-end scenario in this domain, so know each hop's responsibilities.
Bronze: Raw Ingestion with Auto Loader
The bronze hop ingests raw files incrementally. In SQL you use read_files (Auto Loader under the hood); in Python you use cloudFiles. Bronze is an append-only streaming table that keeps the raw record plus ingestion metadata.
CREATE OR REFRESH STREAMING TABLE bronze_orders
AS SELECT *, current_timestamp() AS ingest_ts, _metadata.file_path AS src_file
FROM STREAM read_files('/landing/orders', format => 'json');
Because Auto Loader checkpoints which files it has consumed, each run picks up only newly arrived files — true incremental ingestion with no manual file bookkeeping.
Silver: Cleanse, Enforce, Deduplicate
The silver hop reads from bronze using STREAM(bronze_orders) and applies quality and shaping logic: expectations to drop or fail on bad rows, explicit type casts, null handling, and deduplication.
CREATE OR REFRESH STREAMING TABLE silver_orders (
CONSTRAINT valid_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_amt EXPECT (amount > 0)
)
AS SELECT
order_id,
CAST(amount AS DECIMAL(10,2)) AS amount,
to_timestamp(order_ts) AS order_ts,
customer_id
FROM STREAM(bronze_orders);
Silver is where the data becomes analytics-grade: conformed, typed, validated, and deduplicated. Note the use of STREAM(...) to declare a streaming read from an upstream pipeline table — a frequent syntax point on the exam.
Gold: Business Aggregations
The gold hop produces curated metrics as materialized views, joining and aggregating silver into business KPIs.
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_revenue
AS SELECT
date(order_ts) AS day,
customer_region,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM silver_orders s
JOIN dim_customers c ON s.customer_id = c.id
GROUP BY date(order_ts), customer_region;
One Declared Unit, Automatically Orchestrated
You define all three datasets in the same pipeline and never specify run order. The framework reads the queries, sees silver_orders depends on bronze_orders and gold_daily_revenue depends on silver_orders, and constructs the DAG:
read_files (Auto Loader)
|
bronze_orders (streaming table, append-only)
|
silver_orders (streaming table, expectations + dedupe + casts)
|
gold_daily_revenue (materialized view, JOIN + GROUP BY)
| Hop | Dataset type | Key operations |
|---|---|---|
| Bronze | Streaming table | Auto Loader ingest, raw + metadata |
| Silver | Streaming table | Expectations, casting, dedupe, STREAM() read |
| Gold | Materialized view | JOIN, GROUP BY, KPIs |
Because it is one managed pipeline, you get unified observability: the event log records lineage, data-quality expectation metrics, and run history across all hops. Triggered mode refreshes the whole chain on a schedule; continuous mode keeps it updating in near real time. The same pattern scales by simply adding more silver or gold datasets that read from existing tables.
Operational Benefits and a Worked Walkthrough
Building the multi-hop flow as one declarative pipeline (rather than three hand-orchestrated jobs) delivers concrete advantages the exam emphasizes:
- Single source of truth for orchestration — no external scheduler wiring bronze to silver to gold; the DAG is derived from the SQL.
- Built-in data quality — expectations at the silver hop quarantine or reject bad rows and surface metrics in the event log.
- Automatic recovery — in production mode the framework retries transient failures and restarts compute.
- Lineage and observability — you can trace a gold metric back through silver to the exact bronze file that produced it.
End-to-end flow for an orders pipeline
- Files land in
/landing/ordersas JSON throughout the day. - Bronze (
bronze_orders, streaming table) usesread_filesto ingest only new files, addingingest_tsandsrc_file. - Silver (
silver_orders, streaming table) readsSTREAM(bronze_orders), drops rows failingorder_id IS NOT NULL, castsamountto DECIMAL, and deduplicates by keeping the latest event perorder_id. - Gold (
gold_daily_revenue, materialized view) joinssilver_ordersto a customer dimension and aggregates revenue by day and region.
Run the pipeline in triggered mode each morning and it refreshes the entire chain incrementally; switch to continuous for near-real-time dashboards. Adding a second gold mart later requires only one more CREATE OR REFRESH MATERIALIZED VIEW that reads silver — the framework slots it into the DAG with no other changes.
Design Decisions Within a Multi-Hop Pipeline
Several implementation choices recur in exam scenarios, and each maps to a specific layer:
| Decision | Where it belongs | Why |
|---|---|---|
| Ingest new files only | Bronze | Auto Loader / read_files track processed files |
| Reject or quarantine bad rows | Silver | Expectations with ON VIOLATION DROP/FAIL |
| Deduplicate by business key | Silver | ROW_NUMBER + STREAM read before downstream |
| Type casting and normalization | Silver | Produce conformed, analytics-ready schema |
| Aggregate to KPIs | Gold | Materialized view with GROUP BY |
| Join to dimensions | Gold (or silver enrich) | Combine conformed entities for reporting |
Streaming-table vs materialized-view placement
Within the same pipeline you mix dataset types deliberately. Bronze and silver are usually streaming tables because they process appends incrementally and cheaply. Gold is usually a materialized view because business metrics must reflect the current full state of silver — including any corrections that flowed through as updates. A common mistake is making a gold aggregation a streaming table; an append-only streaming aggregate cannot easily recompute totals when upstream rows change, whereas a materialized view simply re-derives the result.
Matching dataset type to the layer's refresh semantics is exactly what the multi-hop implementation objective tests.
In a SQL declarative multi-hop pipeline, which function is used in the BRONZE table to incrementally ingest raw files from cloud storage with Auto Loader?
Which dataset type and operation are MOST appropriate for the gold table in a multi-hop pipeline?
How are the bronze, silver, and gold tables in a single declarative multi-hop pipeline scheduled relative to one another?
Which operational benefit do you get from building the bronze-silver-gold flow as a single declarative pipeline rather than three separately scheduled jobs?