3.5 Lakeflow Declarative Pipelines (Delta Live Tables)
Key Takeaways
- Lakeflow Declarative Pipelines (formerly Delta Live Tables/DLT) is a framework for building reliable ETL pipelines with automatic dependency management.
- You declare the desired end state of your data (WHAT), and the framework handles orchestration, error handling, and retries (HOW).
- Three dataset types: streaming tables (append-only, incremental), materialized views (full refresh, recomputed), and views (temporary, not persisted).
- Expectations define data quality constraints that can warn, drop invalid records, or fail the pipeline.
- Pipelines automatically resolve dependencies between datasets and determine the optimal execution order.
Lakeflow Declarative Pipelines (Delta Live Tables)
Quick Answer: Lakeflow Declarative Pipelines let you define the desired state of your data transformations (the WHAT), while the framework automatically handles execution order, dependency resolution, error handling, and retries (the HOW). Choose streaming tables for incremental processing, materialized views for complex transformations, and expectations for data quality enforcement.
Declarative vs. Imperative
| Approach | You Define | Framework Handles |
|---|---|---|
| Imperative (traditional) | Step-by-step execution logic | Nothing — you manage everything |
| Declarative (Lakeflow) | The desired end state of each dataset | Execution order, dependencies, retries, error handling |
Dataset Types
Streaming Tables
Best for incremental, append-only workloads:
-- SQL: Create a streaming table
CREATE OR REFRESH STREAMING TABLE raw_orders
AS SELECT * FROM STREAM read_files('/data/raw/orders/', format => 'json');
-- SQL: Streaming table from another streaming table
CREATE OR REFRESH STREAMING TABLE cleaned_orders (
CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT
order_id,
customer_id,
CAST(amount AS DECIMAL(10,2)) AS amount,
CAST(order_date AS DATE) AS order_date
FROM STREAM(LIVE.raw_orders);
# Python: Create a streaming table
import dlt
@dlt.table(comment="Raw orders from cloud storage")
def raw_orders():
return (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/data/raw/orders/")
)
Key properties of streaming tables:
- Each input row is processed exactly once
- New data is appended incrementally
- Ideal for data ingestion from cloud storage, Kafka, etc.
- Uses Structured Streaming under the hood
Materialized Views
Best for complex transformations and aggregations:
-- SQL: Create a materialized view
CREATE OR REFRESH MATERIALIZED VIEW daily_sales_summary
AS SELECT
order_date,
COUNT(DISTINCT order_id) AS total_orders,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM LIVE.cleaned_orders
GROUP BY order_date;
@dlt.table(comment="Daily sales aggregation")
def daily_sales_summary():
return (dlt.read("cleaned_orders")
.groupBy("order_date")
.agg(
count("order_id").alias("total_orders"),
sum("amount").alias("total_revenue"),
avg("amount").alias("avg_order_value")
)
)
Key properties of materialized views:
- Results are pre-computed and stored (not computed on each query)
- Refreshed using incremental refresh when possible
- Can use any SQL operation (JOINs, aggregations, window functions)
- Ideal for gold-layer analytics tables
Views (Temporary)
-- Temporary: not persisted, used for intermediate transformations
CREATE TEMPORARY STREAMING LIVE VIEW enriched_orders
AS SELECT
o.*,
c.customer_name,
c.segment
FROM STREAM(LIVE.cleaned_orders) o
LEFT JOIN LIVE.customers c ON o.customer_id = c.customer_id;
Key properties of views:
- Not persisted — computed on each pipeline run
- Used for intermediate transformations
- Reduce storage costs for transformations that don't need to be materialized
Choosing the Right Dataset Type
| Scenario | Dataset Type |
|---|---|
| Ingest raw files from cloud storage | Streaming Table |
| Append-only event processing | Streaming Table |
| Aggregated daily/weekly summaries | Materialized View |
| JOIN-heavy transformations | Materialized View |
| Intermediate cleanup step (not queried directly) | View |
Expectations (Data Quality)
Expectations define data quality constraints on pipeline datasets:
Expectation Actions
| Action | SQL Syntax | Behavior |
|---|---|---|
| Warn | EXPECT (condition) | Log violation metrics; keep all rows |
| Drop | EXPECT (condition) ON VIOLATION DROP ROW | Remove rows that violate |
| Fail | EXPECT (condition) ON VIOLATION FAIL UPDATE | Stop the pipeline on violation |
Examples
-- Warn: log violations but keep all rows
CREATE OR REFRESH STREAMING TABLE orders (
CONSTRAINT valid_amount EXPECT (amount > 0),
CONSTRAINT valid_date EXPECT (order_date IS NOT NULL)
)
AS SELECT * FROM STREAM(LIVE.raw_orders);
-- Drop: remove invalid rows
CREATE OR REFRESH STREAMING TABLE clean_orders (
CONSTRAINT valid_order EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT positive_amount EXPECT (amount > 0) ON VIOLATION DROP ROW
)
AS SELECT * FROM STREAM(LIVE.raw_orders);
-- Fail: stop pipeline if any violation occurs
CREATE OR REFRESH STREAMING TABLE critical_orders (
CONSTRAINT required_fields EXPECT (order_id IS NOT NULL AND customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM STREAM(LIVE.raw_orders);
# Python expectations
@dlt.table
@dlt.expect("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect_or_fail("required_customer", "customer_id IS NOT NULL")
def cleaned_orders():
return dlt.read_stream("raw_orders")
Expectation Metrics
Pipeline UI shows:
- Number of records that passed each expectation
- Number of records that failed each expectation
- Pass rate percentage for each expectation
- Historical trends across pipeline runs
Automatic Dependency Resolution
Lakeflow automatically determines the execution order based on dataset dependencies:
-- Pipeline defines three datasets with implicit dependencies:
-- raw_orders → cleaned_orders → daily_summary
CREATE OR REFRESH STREAMING TABLE raw_orders
AS SELECT * FROM STREAM read_files('/data/raw/orders/');
CREATE OR REFRESH STREAMING TABLE cleaned_orders
AS SELECT * FROM STREAM(LIVE.raw_orders); -- Depends on raw_orders
CREATE OR REFRESH MATERIALIZED VIEW daily_summary
AS SELECT order_date, SUM(amount) FROM LIVE.cleaned_orders -- Depends on cleaned_orders
GROUP BY order_date;
The framework:
- Analyzes the
LIVE.references to build a dependency graph - Executes datasets in the correct order (topological sort)
- Parallelizes independent branches
- Handles failures and retries automatically
On the Exam: Know the three dataset types (streaming table, materialized view, view), the three expectation actions (warn, drop, fail), and how the LIVE. keyword creates automatic dependencies.
A data engineer wants to create a pipeline dataset that performs a GROUP BY aggregation over a silver table. The results should be pre-computed and stored. Which dataset type should they use?
Which expectation action removes records that violate the data quality constraint without stopping the pipeline?
How does Lakeflow Declarative Pipelines determine the execution order of datasets?
What is the key difference between a streaming table and a materialized view in Lakeflow Declarative Pipelines?