3.6 Implementing a Multi-Hop Pipeline
Key Takeaways
- A multi-hop pipeline implements the medallion architecture using Lakeflow Declarative Pipelines with bronze, silver, and gold streaming tables and materialized views.
- Bronze datasets use Auto Loader (read_files or cloudFiles) to ingest raw data from cloud storage.
- Silver datasets apply schema enforcement, data quality expectations, deduplication, and type casting.
- Gold datasets create business-level aggregations and KPIs using materialized views with JOINs and GROUP BY.
- The entire pipeline is defined declaratively and executed as a single unit with automatic dependency resolution.
Implementing a Multi-Hop Pipeline
Quick Answer: A multi-hop pipeline chains Lakeflow datasets through bronze (raw ingestion), silver (cleansed), and gold (aggregated) layers. Each layer is a streaming table or materialized view with appropriate expectations and transformations.
Complete Pipeline Example (SQL)
-- ===== BRONZE LAYER: Raw Ingestion =====
-- Ingest raw order events from cloud storage
CREATE OR REFRESH STREAMING TABLE bronze_orders
COMMENT 'Raw order events from cloud storage'
AS SELECT
*,
current_timestamp() AS _ingest_timestamp,
_metadata.file_path AS _source_file
FROM STREAM read_files(
'/data/raw/orders/',
format => 'json',
header => 'true'
);
-- Ingest customer reference data
CREATE OR REFRESH STREAMING TABLE bronze_customers
AS SELECT * FROM STREAM read_files('/data/raw/customers/', format => 'csv');
-- ===== SILVER LAYER: Cleansed and Validated =====
-- Cleansed orders with data quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
CONSTRAINT valid_date EXPECT (order_date IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT 'Cleansed and validated order data'
AS SELECT
CAST(order_id AS BIGINT) AS order_id,
CAST(customer_id AS BIGINT) AS customer_id,
CAST(product_id AS BIGINT) AS product_id,
CAST(quantity AS INT) AS quantity,
CAST(amount AS DECIMAL(10,2)) AS amount,
CAST(order_date AS DATE) AS order_date,
trim(upper(status)) AS order_status,
_ingest_timestamp
FROM STREAM(LIVE.bronze_orders);
-- Cleansed customers with deduplication
CREATE OR REFRESH MATERIALIZED VIEW silver_customers (
CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM LIVE.bronze_customers
) WHERE rn = 1;
-- ===== GOLD LAYER: Business-Ready Aggregations =====
-- Daily sales summary
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_sales
COMMENT 'Daily sales KPIs for executive dashboard'
AS SELECT
o.order_date,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(o.amount) AS total_revenue,
AVG(o.amount) AS avg_order_value,
SUM(o.quantity) AS total_items_sold
FROM LIVE.silver_orders o
GROUP BY o.order_date;
-- Customer lifetime value
CREATE OR REFRESH MATERIALIZED VIEW gold_customer_ltv
COMMENT 'Customer lifetime value for marketing'
AS SELECT
c.customer_id,
c.customer_name,
c.segment,
COUNT(o.order_id) AS total_orders,
SUM(o.amount) AS lifetime_value,
AVG(o.amount) AS avg_order_value,
MIN(o.order_date) AS first_order_date,
MAX(o.order_date) AS last_order_date,
datediff(MAX(o.order_date), MIN(o.order_date)) AS customer_tenure_days
FROM LIVE.silver_customers c
LEFT JOIN LIVE.silver_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.segment;
Complete Pipeline Example (Python)
import dlt
from pyspark.sql.functions import *
# ===== BRONZE LAYER =====
@dlt.table(comment="Raw orders from cloud storage")
def bronze_orders():
return (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/bronze/orders/schema")
.load("/data/raw/orders/")
.withColumn("_ingest_timestamp", current_timestamp())
.withColumn("_source_file", input_file_name())
)
# ===== SILVER LAYER =====
@dlt.table(comment="Cleansed orders")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
def silver_orders():
return (dlt.read_stream("bronze_orders")
.select(
col("order_id").cast("bigint"),
col("customer_id").cast("bigint"),
col("amount").cast("decimal(10,2)"),
to_date("order_date").alias("order_date"),
upper(trim("status")).alias("order_status"),
"_ingest_timestamp"
)
)
# ===== GOLD LAYER =====
@dlt.table(comment="Daily sales KPIs")
def gold_daily_sales():
return (dlt.read("silver_orders")
.groupBy("order_date")
.agg(
countDistinct("order_id").alias("total_orders"),
sum("amount").alias("total_revenue"),
avg("amount").alias("avg_order_value")
)
)
Pipeline Configuration
When creating a Lakeflow Declarative Pipeline in the Databricks UI:
| Setting | Purpose |
|---|---|
| Target catalog and schema | Where tables are created in Unity Catalog |
| Storage location | Where pipeline data and checkpoints are stored |
| Pipeline mode | Triggered (on-demand) or Continuous |
| Cluster mode | Fixed size or enhanced autoscaling |
| Photon | Enable for faster SQL execution |
| Channel | Current (stable) or Preview (latest features) |
On the Exam: Be able to trace data flow through a multi-hop pipeline. Know which transformations apply at each layer and why certain dataset types (streaming table vs. materialized view) are chosen for each layer.
In a multi-hop pipeline, which dataset type is most appropriate for the bronze layer that ingests raw files from cloud storage?
Why is a materialized view typically used for gold-layer aggregations instead of a streaming table?