1.1 The Lakehouse Architecture

Key Takeaways

  • The Lakehouse combines the low-cost storage of data lakes with the data management and ACID transaction capabilities of data warehouses.
  • Databricks implements the Lakehouse using Delta Lake as the storage format, providing reliability, performance, and governance on top of cloud object storage.
  • The Lakehouse eliminates the need for separate data lake and data warehouse systems, reducing data duplication and ETL complexity.
  • Key Lakehouse benefits include ACID transactions, schema enforcement, time travel, unified batch and streaming, and direct BI access to data lake storage.
  • The Data Intelligence Platform is Databricks' implementation of the Lakehouse, adding AI/ML capabilities, governance with Unity Catalog, and serverless compute.
Last updated: March 2026

The Lakehouse Architecture

Quick Answer: The Lakehouse architecture combines the best features of data lakes (low-cost scalable storage, support for all data types) with data warehouses (ACID transactions, schema enforcement, governance). Databricks implements this through Delta Lake on cloud object storage, eliminating the need for separate lake and warehouse systems.

The Problem: Traditional Two-Tier Architecture

Before the Lakehouse, organizations typically maintained two separate systems:

Data Lake

  • Stores raw data in open formats (Parquet, CSV, JSON) on cheap cloud storage (S3, ADLS, GCS)
  • Supports all data types: structured, semi-structured, unstructured
  • Great for data science and ML workloads
  • Problems: No ACID transactions, no schema enforcement, "data swamp" risk, poor BI performance

Data Warehouse

  • Stores curated, structured data in proprietary formats
  • ACID transactions, schema enforcement, fast SQL queries
  • Great for BI and reporting
  • Problems: Expensive storage, limited to structured data, vendor lock-in, data duplication from lake

The Two-Tier Pain Points

IssueImpact
Data duplicationSame data stored in both lake and warehouse
ETL complexityComplex pipelines to move data between systems
Data stalenessWarehouse data lags behind the lake
CostPaying for two separate systems and storage
Governance gapsDifferent security models across systems
Limited ML supportWarehouses not designed for ML workloads

The Solution: Lakehouse Architecture

The Lakehouse architecture adds a metadata and management layer (Delta Lake) directly on top of data lake storage:

Key Lakehouse Properties

  1. ACID Transactions: Every write operation is atomic — it either fully succeeds or fully rolls back. No partial writes or corrupted data.

  2. Schema Enforcement and Evolution: Tables have defined schemas. Writes that don't match the schema are rejected. Schemas can evolve safely with mergeSchema.

  3. Time Travel: Every change to a Delta table is versioned. You can query any historical version of the data.

  4. Unified Batch and Streaming: The same Delta table can be written to by batch jobs and read by streaming queries (and vice versa).

  5. Direct BI Access: BI tools connect directly to Lakehouse data — no need to copy data into a separate warehouse.

  6. Open Format: Delta Lake uses Parquet as the underlying file format. Data is not locked into a proprietary system.

  7. Scalable Metadata: The Delta transaction log handles tables with billions of partitions and files efficiently.

Databricks Data Intelligence Platform

Databricks builds on the Lakehouse with the Data Intelligence Platform, which adds:

ComponentPurpose
Delta LakeStorage layer with ACID transactions and versioning
Unity CatalogCentralized governance, access control, and data lineage
Photon EngineHigh-performance C++ query engine for SQL workloads
Serverless ComputeOn-demand compute without cluster management
Databricks SQLSQL analytics and dashboarding on Lakehouse data
LakeflowData pipeline orchestration (Declarative Pipelines + Jobs)
MLflowML experiment tracking, model registry, and deployment
Mosaic AILLM serving, model training, and AI application development

On the Exam: Understand that the Lakehouse is not just "Delta Lake." It is the combination of open storage format + metadata layer + governance + compute engine that together provide warehouse-like reliability on lake storage.

Loading diagram...
Lakehouse Architecture Overview
Test Your Knowledge

What is the primary advantage of the Lakehouse architecture over a traditional two-tier (data lake + data warehouse) approach?

A
B
C
D
Test Your Knowledge

Which underlying file format does Delta Lake use for data storage?

A
B
C
D
Test Your Knowledge

Which Databricks component provides centralized data governance, access control, and data lineage across the Lakehouse?

A
B
C
D