4.4 Testing and Deployment Best Practices

Key Takeaways

Development, staging, and production environments should be isolated using separate catalogs or schemas in Unity Catalog.
Lakeflow Declarative Pipelines in development mode create tables prefixed with the developer name to avoid conflicts.
Integration tests validate end-to-end pipeline behavior using small sample datasets in a staging environment.
Promotion from dev to prod should follow a code review, CI/CD pipeline, and validation workflow.
Parameterize notebooks and jobs with environment-specific values (catalog names, storage paths) rather than hardcoding.

Last updated: March 2026

Testing and Deployment Best Practices

Quick Answer: Use separate Unity Catalog catalogs for dev/staging/prod environments. Parameterize code with environment variables. Deploy via Databricks Asset Bundles with CI/CD. Test with integration tests on sample data in staging before promoting to production.

Environment Isolation

Unity Catalog-Based Isolation

Production:   prod_catalog.schema.table
Staging:      staging_catalog.schema.table
Development:  dev_catalog.schema.table

Environment	Catalog	Purpose	Access
Development	dev_catalog	Individual developer exploration and testing	Developers
Staging	staging_catalog	Integration testing with production-like data	CI/CD pipeline
Production	prod_catalog	Live data serving BI, ML, and applications	Production service principals

Parameterization

# Use widgets for environment-specific configuration
dbutils.widgets.text("catalog", "dev_catalog", "Target Catalog")
catalog = dbutils.widgets.get("catalog")

# All table references use the parameterized catalog
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"SELECT * FROM {catalog}.silver.orders")

-- SQL with parameter
USE CATALOG ${catalog};
SELECT * FROM silver.orders;

Testing Strategies

Unit Tests

# Test individual transformation functions
def test_classify_amount():
    assert classify_amount(1500) == "high"
    assert classify_amount(500) == "medium"
    assert classify_amount(50) == "low"

# Run with pytest
# pytest tests/test_transforms.py

Integration Tests

# Test end-to-end pipeline with sample data
def test_pipeline_end_to_end():
    # 1. Create sample input data
    sample_data = spark.createDataFrame([
        (1, 100, "2026-01-01"),
        (2, 200, "2026-01-02"),
    ], ["order_id", "amount", "order_date"])

    # 2. Write to staging bronze table
    sample_data.write.mode("overwrite").saveAsTable(
        "staging_catalog.bronze.test_orders"
    )

    # 3. Run the pipeline (or the transformation notebook)
    dbutils.notebook.run("./transform", timeout_seconds=300)

    # 4. Verify output
    result = spark.table("staging_catalog.silver.test_orders")
    assert result.count() == 2
    assert result.filter("amount <= 0").count() == 0

Data Quality Tests

-- Validate data quality after pipeline run
-- These can be SQL tasks in a Databricks job

-- Test: No null order IDs
SELECT COUNT(*) AS null_count
FROM prod_catalog.silver.orders
WHERE order_id IS NULL;
-- Expected: 0

-- Test: All amounts are positive
SELECT COUNT(*) AS negative_count
FROM prod_catalog.silver.orders
WHERE amount <= 0;
-- Expected: 0

-- Test: No duplicate orders
SELECT order_id, COUNT(*) AS cnt
FROM prod_catalog.silver.orders
GROUP BY order_id HAVING cnt > 1;
-- Expected: 0 rows

Deployment Workflow

Recommended CI/CD Pipeline

1. Developer creates feature branch
2. Developer tests locally in dev environment
3. Pull request submitted for code review
4. CI pipeline runs:
   a. databricks bundle validate -t staging
   b. Deploy to staging: databricks bundle deploy -t staging
   c. Run integration tests: databricks bundle run test-job -t staging
5. Code review approved
6. Merge to main branch
7. CD pipeline runs:
   a. databricks bundle validate -t prod
   b. Deploy to production: databricks bundle deploy -t prod

Deployment Checklist

Check	Verification
Bundle validates	`databricks bundle validate` succeeds
Integration tests pass	All staging tests complete successfully
Data quality validated	Expectations pass in staging pipeline run
Code reviewed	PR approved by at least one reviewer
Performance tested	Pipeline completes within SLA on staging data
Rollback plan	Previous version can be restored quickly

Lakeflow Pipeline Modes

Mode	Behavior	Use Case
Development	Tables prefixed with developer name; pipeline does not publish to target	Developer testing and iteration
Production	Tables published to target catalog/schema; full execution	Live production workloads

On the Exam: Understand the concept of environment isolation using catalogs, parameterization for environment-agnostic code, and the deployment workflow from dev to staging to production using Asset Bundles and CI/CD.

Test Your Knowledge

How should a data engineer isolate development, staging, and production environments in a Databricks Lakehouse?

Use separate Databricks workspaces for each environment

Use separate Unity Catalog catalogs (dev_catalog, staging_catalog, prod_catalog) for each environment

Use separate clusters for each environment

Use naming conventions in table names (e.g., dev_orders, prod_orders in the same schema)

Test Your Knowledge

What behavior does "development mode" enable for Lakeflow Declarative Pipelines?

It runs the pipeline faster using more compute resources

It prefixes table names with the developer name and does not publish to the production target

It disables data quality expectations for faster iteration

It automatically creates sample data for testing

Up Next

4.5 Pipeline Performance and Cost Optimization

Continue learning

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)