3.6 Implementing a Multi-Hop Pipeline

Key Takeaways

  • A multi-hop pipeline implements the medallion architecture using Lakeflow Declarative Pipelines with bronze, silver, and gold streaming tables and materialized views.
  • Bronze datasets use Auto Loader (read_files or cloudFiles) to ingest raw data from cloud storage.
  • Silver datasets apply schema enforcement, data quality expectations, deduplication, and type casting.
  • Gold datasets create business-level aggregations and KPIs using materialized views with JOINs and GROUP BY.
  • The entire pipeline is defined declaratively and executed as a single unit with automatic dependency resolution.
Last updated: March 2026

Implementing a Multi-Hop Pipeline

Quick Answer: A multi-hop pipeline chains Lakeflow datasets through bronze (raw ingestion), silver (cleansed), and gold (aggregated) layers. Each layer is a streaming table or materialized view with appropriate expectations and transformations.

Complete Pipeline Example (SQL)

-- ===== BRONZE LAYER: Raw Ingestion =====

-- Ingest raw order events from cloud storage
CREATE OR REFRESH STREAMING TABLE bronze_orders
COMMENT 'Raw order events from cloud storage'
AS SELECT
    *,
    current_timestamp() AS _ingest_timestamp,
    _metadata.file_path AS _source_file
FROM STREAM read_files(
    '/data/raw/orders/',
    format => 'json',
    header => 'true'
);

-- Ingest customer reference data
CREATE OR REFRESH STREAMING TABLE bronze_customers
AS SELECT * FROM STREAM read_files('/data/raw/customers/', format => 'csv');

-- ===== SILVER LAYER: Cleansed and Validated =====

-- Cleansed orders with data quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
    CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
    CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
    CONSTRAINT valid_date EXPECT (order_date IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT 'Cleansed and validated order data'
AS SELECT
    CAST(order_id AS BIGINT) AS order_id,
    CAST(customer_id AS BIGINT) AS customer_id,
    CAST(product_id AS BIGINT) AS product_id,
    CAST(quantity AS INT) AS quantity,
    CAST(amount AS DECIMAL(10,2)) AS amount,
    CAST(order_date AS DATE) AS order_date,
    trim(upper(status)) AS order_status,
    _ingest_timestamp
FROM STREAM(LIVE.bronze_orders);

-- Cleansed customers with deduplication
CREATE OR REFRESH MATERIALIZED VIEW silver_customers (
    CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
    FROM LIVE.bronze_customers
) WHERE rn = 1;

-- ===== GOLD LAYER: Business-Ready Aggregations =====

-- Daily sales summary
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_sales
COMMENT 'Daily sales KPIs for executive dashboard'
AS SELECT
    o.order_date,
    COUNT(DISTINCT o.order_id) AS total_orders,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    SUM(o.amount) AS total_revenue,
    AVG(o.amount) AS avg_order_value,
    SUM(o.quantity) AS total_items_sold
FROM LIVE.silver_orders o
GROUP BY o.order_date;

-- Customer lifetime value
CREATE OR REFRESH MATERIALIZED VIEW gold_customer_ltv
COMMENT 'Customer lifetime value for marketing'
AS SELECT
    c.customer_id,
    c.customer_name,
    c.segment,
    COUNT(o.order_id) AS total_orders,
    SUM(o.amount) AS lifetime_value,
    AVG(o.amount) AS avg_order_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date,
    datediff(MAX(o.order_date), MIN(o.order_date)) AS customer_tenure_days
FROM LIVE.silver_customers c
LEFT JOIN LIVE.silver_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.segment;

Complete Pipeline Example (Python)

import dlt
from pyspark.sql.functions import *

# ===== BRONZE LAYER =====

@dlt.table(comment="Raw orders from cloud storage")
def bronze_orders():
    return (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", "/checkpoints/bronze/orders/schema")
        .load("/data/raw/orders/")
        .withColumn("_ingest_timestamp", current_timestamp())
        .withColumn("_source_file", input_file_name())
    )

# ===== SILVER LAYER =====

@dlt.table(comment="Cleansed orders")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
def silver_orders():
    return (dlt.read_stream("bronze_orders")
        .select(
            col("order_id").cast("bigint"),
            col("customer_id").cast("bigint"),
            col("amount").cast("decimal(10,2)"),
            to_date("order_date").alias("order_date"),
            upper(trim("status")).alias("order_status"),
            "_ingest_timestamp"
        )
    )

# ===== GOLD LAYER =====

@dlt.table(comment="Daily sales KPIs")
def gold_daily_sales():
    return (dlt.read("silver_orders")
        .groupBy("order_date")
        .agg(
            countDistinct("order_id").alias("total_orders"),
            sum("amount").alias("total_revenue"),
            avg("amount").alias("avg_order_value")
        )
    )

Pipeline Configuration

When creating a Lakeflow Declarative Pipeline in the Databricks UI:

SettingPurpose
Target catalog and schemaWhere tables are created in Unity Catalog
Storage locationWhere pipeline data and checkpoints are stored
Pipeline modeTriggered (on-demand) or Continuous
Cluster modeFixed size or enhanced autoscaling
PhotonEnable for faster SQL execution
ChannelCurrent (stable) or Preview (latest features)

On the Exam: Be able to trace data flow through a multi-hop pipeline. Know which transformations apply at each layer and why certain dataset types (streaming table vs. materialized view) are chosen for each layer.

Test Your Knowledge

In a multi-hop pipeline, which dataset type is most appropriate for the bronze layer that ingests raw files from cloud storage?

A
B
C
D
Test Your Knowledge

Why is a materialized view typically used for gold-layer aggregations instead of a streaming table?

A
B
C
D