3.6 Implementing a Multi-Hop Pipeline

Key Takeaways

A multi-hop pipeline implements the medallion architecture using Lakeflow Declarative Pipelines with bronze, silver, and gold streaming tables and materialized views.
Bronze datasets use Auto Loader (read_files or cloudFiles) to ingest raw data from cloud storage.
Silver datasets apply schema enforcement, data quality expectations, deduplication, and type casting.
Gold datasets create business-level aggregations and KPIs using materialized views with JOINs and GROUP BY.
The entire pipeline is defined declaratively and executed as a single unit with automatic dependency resolution.

Last updated: March 2026

Implementing a Multi-Hop Pipeline

Quick Answer: A multi-hop pipeline chains Lakeflow datasets through bronze (raw ingestion), silver (cleansed), and gold (aggregated) layers. Each layer is a streaming table or materialized view with appropriate expectations and transformations.

Complete Pipeline Example (SQL)

-- ===== BRONZE LAYER: Raw Ingestion =====

-- Ingest raw order events from cloud storage
CREATE OR REFRESH STREAMING TABLE bronze_orders
COMMENT 'Raw order events from cloud storage'
AS SELECT
    *,
    current_timestamp() AS _ingest_timestamp,
    _metadata.file_path AS _source_file
FROM STREAM read_files(
    '/data/raw/orders/',
    format => 'json',
    header => 'true'
);

-- Ingest customer reference data
CREATE OR REFRESH STREAMING TABLE bronze_customers
AS SELECT * FROM STREAM read_files('/data/raw/customers/', format => 'csv');

-- ===== SILVER LAYER: Cleansed and Validated =====

-- Cleansed orders with data quality expectations
CREATE OR REFRESH STREAMING TABLE silver_orders (
    CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
    CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
    CONSTRAINT valid_date EXPECT (order_date IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT 'Cleansed and validated order data'
AS SELECT
    CAST(order_id AS BIGINT) AS order_id,
    CAST(customer_id AS BIGINT) AS customer_id,
    CAST(product_id AS BIGINT) AS product_id,
    CAST(quantity AS INT) AS quantity,
    CAST(amount AS DECIMAL(10,2)) AS amount,
    CAST(order_date AS DATE) AS order_date,
    trim(upper(status)) AS order_status,
    _ingest_timestamp
FROM STREAM(LIVE.bronze_orders);

-- Cleansed customers with deduplication
CREATE OR REFRESH MATERIALIZED VIEW silver_customers (
    CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
    FROM LIVE.bronze_customers
) WHERE rn = 1;

-- ===== GOLD LAYER: Business-Ready Aggregations =====

-- Daily sales summary
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_sales
COMMENT 'Daily sales KPIs for executive dashboard'
AS SELECT
    o.order_date,
    COUNT(DISTINCT o.order_id) AS total_orders,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    SUM(o.amount) AS total_revenue,
    AVG(o.amount) AS avg_order_value,
    SUM(o.quantity) AS total_items_sold
FROM LIVE.silver_orders o
GROUP BY o.order_date;

-- Customer lifetime value
CREATE OR REFRESH MATERIALIZED VIEW gold_customer_ltv
COMMENT 'Customer lifetime value for marketing'
AS SELECT
    c.customer_id,
    c.customer_name,
    c.segment,
    COUNT(o.order_id) AS total_orders,
    SUM(o.amount) AS lifetime_value,
    AVG(o.amount) AS avg_order_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date,
    datediff(MAX(o.order_date), MIN(o.order_date)) AS customer_tenure_days
FROM LIVE.silver_customers c
LEFT JOIN LIVE.silver_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.segment;

Complete Pipeline Example (Python)

import dlt
from pyspark.sql.functions import *

# ===== BRONZE LAYER =====

@dlt.table(comment="Raw orders from cloud storage")
def bronze_orders():
    return (spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", "/checkpoints/bronze/orders/schema")
        .load("/data/raw/orders/")
        .withColumn("_ingest_timestamp", current_timestamp())
        .withColumn("_source_file", input_file_name())
    )

# ===== SILVER LAYER =====

@dlt.table(comment="Cleansed orders")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
def silver_orders():
    return (dlt.read_stream("bronze_orders")
        .select(
            col("order_id").cast("bigint"),
            col("customer_id").cast("bigint"),
            col("amount").cast("decimal(10,2)"),
            to_date("order_date").alias("order_date"),
            upper(trim("status")).alias("order_status"),
            "_ingest_timestamp"
        )
    )

# ===== GOLD LAYER =====

@dlt.table(comment="Daily sales KPIs")
def gold_daily_sales():
    return (dlt.read("silver_orders")
        .groupBy("order_date")
        .agg(
            countDistinct("order_id").alias("total_orders"),
            sum("amount").alias("total_revenue"),
            avg("amount").alias("avg_order_value")
        )
    )

Pipeline Configuration

When creating a Lakeflow Declarative Pipeline in the Databricks UI:

Setting	Purpose
Target catalog and schema	Where tables are created in Unity Catalog
Storage location	Where pipeline data and checkpoints are stored
Pipeline mode	Triggered (on-demand) or Continuous
Cluster mode	Fixed size or enhanced autoscaling
Photon	Enable for faster SQL execution
Channel	Current (stable) or Preview (latest features)

On the Exam: Be able to trace data flow through a multi-hop pipeline. Know which transformations apply at each layer and why certain dataset types (streaming table vs. materialized view) are chosen for each layer.

Test Your Knowledge

In a multi-hop pipeline, which dataset type is most appropriate for the bronze layer that ingests raw files from cloud storage?

Materialized view

Streaming table

Temporary view

External table

Test Your Knowledge

Why is a materialized view typically used for gold-layer aggregations instead of a streaming table?

Materialized views are cheaper to compute

Streaming tables cannot be read by BI tools

Materialized views support aggregations (GROUP BY, JOINs) and recompute results from the full dataset, while streaming tables only append

Gold-layer tables must always be materialized views

Up Next

3.7 Slowly Changing Dimensions (SCD) Patterns

Continue learning

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)