1.1 The Lakehouse Architecture
Key Takeaways
- The Lakehouse combines the low-cost storage of data lakes with the data management and ACID transaction capabilities of data warehouses.
- Databricks implements the Lakehouse using Delta Lake as the storage format, providing reliability, performance, and governance on top of cloud object storage.
- The Lakehouse eliminates the need for separate data lake and data warehouse systems, reducing data duplication and ETL complexity.
- Key Lakehouse benefits include ACID transactions, schema enforcement, time travel, unified batch and streaming, and direct BI access to data lake storage.
- The Data Intelligence Platform is Databricks' implementation of the Lakehouse, adding AI/ML capabilities, governance with Unity Catalog, and serverless compute.
The Lakehouse Architecture
Quick Answer: The Lakehouse architecture combines the best features of data lakes (low-cost scalable storage, support for all data types) with data warehouses (ACID transactions, schema enforcement, governance). Databricks implements this through Delta Lake on cloud object storage, eliminating the need for separate lake and warehouse systems.
The Problem: Traditional Two-Tier Architecture
Before the Lakehouse, organizations typically maintained two separate systems:
Data Lake
- Stores raw data in open formats (Parquet, CSV, JSON) on cheap cloud storage (S3, ADLS, GCS)
- Supports all data types: structured, semi-structured, unstructured
- Great for data science and ML workloads
- Problems: No ACID transactions, no schema enforcement, "data swamp" risk, poor BI performance
Data Warehouse
- Stores curated, structured data in proprietary formats
- ACID transactions, schema enforcement, fast SQL queries
- Great for BI and reporting
- Problems: Expensive storage, limited to structured data, vendor lock-in, data duplication from lake
The Two-Tier Pain Points
| Issue | Impact |
|---|---|
| Data duplication | Same data stored in both lake and warehouse |
| ETL complexity | Complex pipelines to move data between systems |
| Data staleness | Warehouse data lags behind the lake |
| Cost | Paying for two separate systems and storage |
| Governance gaps | Different security models across systems |
| Limited ML support | Warehouses not designed for ML workloads |
The Solution: Lakehouse Architecture
The Lakehouse architecture adds a metadata and management layer (Delta Lake) directly on top of data lake storage:
Key Lakehouse Properties
-
ACID Transactions: Every write operation is atomic — it either fully succeeds or fully rolls back. No partial writes or corrupted data.
-
Schema Enforcement and Evolution: Tables have defined schemas. Writes that don't match the schema are rejected. Schemas can evolve safely with
mergeSchema. -
Time Travel: Every change to a Delta table is versioned. You can query any historical version of the data.
-
Unified Batch and Streaming: The same Delta table can be written to by batch jobs and read by streaming queries (and vice versa).
-
Direct BI Access: BI tools connect directly to Lakehouse data — no need to copy data into a separate warehouse.
-
Open Format: Delta Lake uses Parquet as the underlying file format. Data is not locked into a proprietary system.
-
Scalable Metadata: The Delta transaction log handles tables with billions of partitions and files efficiently.
Databricks Data Intelligence Platform
Databricks builds on the Lakehouse with the Data Intelligence Platform, which adds:
| Component | Purpose |
|---|---|
| Delta Lake | Storage layer with ACID transactions and versioning |
| Unity Catalog | Centralized governance, access control, and data lineage |
| Photon Engine | High-performance C++ query engine for SQL workloads |
| Serverless Compute | On-demand compute without cluster management |
| Databricks SQL | SQL analytics and dashboarding on Lakehouse data |
| Lakeflow | Data pipeline orchestration (Declarative Pipelines + Jobs) |
| MLflow | ML experiment tracking, model registry, and deployment |
| Mosaic AI | LLM serving, model training, and AI application development |
On the Exam: Understand that the Lakehouse is not just "Delta Lake." It is the combination of open storage format + metadata layer + governance + compute engine that together provide warehouse-like reliability on lake storage.
What is the primary advantage of the Lakehouse architecture over a traditional two-tier (data lake + data warehouse) approach?
Which underlying file format does Delta Lake use for data storage?
Which Databricks component provides centralized data governance, access control, and data lineage across the Lakehouse?