4.4 Testing and Deployment Best Practices

Key Takeaways

  • Development, staging, and production environments should be isolated using separate catalogs or schemas in Unity Catalog.
  • Lakeflow Declarative Pipelines in development mode create tables prefixed with the developer name to avoid conflicts.
  • Integration tests validate end-to-end pipeline behavior using small sample datasets in a staging environment.
  • Promotion from dev to prod should follow a code review, CI/CD pipeline, and validation workflow.
  • Parameterize notebooks and jobs with environment-specific values (catalog names, storage paths) rather than hardcoding.
Last updated: March 2026

Testing and Deployment Best Practices

Quick Answer: Use separate Unity Catalog catalogs for dev/staging/prod environments. Parameterize code with environment variables. Deploy via Databricks Asset Bundles with CI/CD. Test with integration tests on sample data in staging before promoting to production.

Environment Isolation

Unity Catalog-Based Isolation

Production:   prod_catalog.schema.table
Staging:      staging_catalog.schema.table
Development:  dev_catalog.schema.table
EnvironmentCatalogPurposeAccess
Developmentdev_catalogIndividual developer exploration and testingDevelopers
Stagingstaging_catalogIntegration testing with production-like dataCI/CD pipeline
Productionprod_catalogLive data serving BI, ML, and applicationsProduction service principals

Parameterization

# Use widgets for environment-specific configuration
dbutils.widgets.text("catalog", "dev_catalog", "Target Catalog")
catalog = dbutils.widgets.get("catalog")

# All table references use the parameterized catalog
spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"SELECT * FROM {catalog}.silver.orders")
-- SQL with parameter
USE CATALOG ${catalog};
SELECT * FROM silver.orders;

Testing Strategies

Unit Tests

# Test individual transformation functions
def test_classify_amount():
    assert classify_amount(1500) == "high"
    assert classify_amount(500) == "medium"
    assert classify_amount(50) == "low"

# Run with pytest
# pytest tests/test_transforms.py

Integration Tests

# Test end-to-end pipeline with sample data
def test_pipeline_end_to_end():
    # 1. Create sample input data
    sample_data = spark.createDataFrame([
        (1, 100, "2026-01-01"),
        (2, 200, "2026-01-02"),
    ], ["order_id", "amount", "order_date"])

    # 2. Write to staging bronze table
    sample_data.write.mode("overwrite").saveAsTable(
        "staging_catalog.bronze.test_orders"
    )

    # 3. Run the pipeline (or the transformation notebook)
    dbutils.notebook.run("./transform", timeout_seconds=300)

    # 4. Verify output
    result = spark.table("staging_catalog.silver.test_orders")
    assert result.count() == 2
    assert result.filter("amount <= 0").count() == 0

Data Quality Tests

-- Validate data quality after pipeline run
-- These can be SQL tasks in a Databricks job

-- Test: No null order IDs
SELECT COUNT(*) AS null_count
FROM prod_catalog.silver.orders
WHERE order_id IS NULL;
-- Expected: 0

-- Test: All amounts are positive
SELECT COUNT(*) AS negative_count
FROM prod_catalog.silver.orders
WHERE amount <= 0;
-- Expected: 0

-- Test: No duplicate orders
SELECT order_id, COUNT(*) AS cnt
FROM prod_catalog.silver.orders
GROUP BY order_id HAVING cnt > 1;
-- Expected: 0 rows

Deployment Workflow

Recommended CI/CD Pipeline

1. Developer creates feature branch
2. Developer tests locally in dev environment
3. Pull request submitted for code review
4. CI pipeline runs:
   a. databricks bundle validate -t staging
   b. Deploy to staging: databricks bundle deploy -t staging
   c. Run integration tests: databricks bundle run test-job -t staging
5. Code review approved
6. Merge to main branch
7. CD pipeline runs:
   a. databricks bundle validate -t prod
   b. Deploy to production: databricks bundle deploy -t prod

Deployment Checklist

CheckVerification
Bundle validatesdatabricks bundle validate succeeds
Integration tests passAll staging tests complete successfully
Data quality validatedExpectations pass in staging pipeline run
Code reviewedPR approved by at least one reviewer
Performance testedPipeline completes within SLA on staging data
Rollback planPrevious version can be restored quickly

Lakeflow Pipeline Modes

ModeBehaviorUse Case
DevelopmentTables prefixed with developer name; pipeline does not publish to targetDeveloper testing and iteration
ProductionTables published to target catalog/schema; full executionLive production workloads

On the Exam: Understand the concept of environment isolation using catalogs, parameterization for environment-agnostic code, and the deployment workflow from dev to staging to production using Asset Bundles and CI/CD.

Test Your Knowledge

How should a data engineer isolate development, staging, and production environments in a Databricks Lakehouse?

A
B
C
D
Test Your Knowledge

What behavior does "development mode" enable for Lakeflow Declarative Pipelines?

A
B
C
D