What is the primary advantage of the Lakehouse architecture over a traditional two-tier (data lake + data warehouse) approach?

It eliminates the need for separate lake and warehouse systems by adding data management directly on lake storage. The Lakehouse adds ACID transactions, schema enforcement, and governance directly on top of data lake storage (cloud object storage), eliminating the need for a separate data warehouse. This reduces data duplication, ETL complexity, and cost.

Which underlying file format does Delta Lake use for data storage?

Parquet. Delta Lake stores data in Parquet format with an additional transaction log (the Delta Log) that provides ACID transactions, time travel, and schema enforcement. Parquet is a columnar format optimized for analytics workloads.

Which Databricks component provides centralized data governance, access control, and data lineage across the Lakehouse?

Unity Catalog. Unity Catalog is the centralized governance solution for the Databricks Lakehouse. It provides fine-grained access control, data lineage tracking, audit logging, and a unified metadata layer across all data and AI assets.

The Lakehouse Architecture

Quick Answer: The Lakehouse architecture combines the best features of data lakes (low-cost scalable storage, support for all data types) with data warehouses (ACID transactions, schema enforcement, governance). Databricks implements this through Delta Lake on cloud object storage, eliminating the need for separate lake and warehouse systems.

The Problem: Traditional Two-Tier Architecture

Before the Lakehouse, organizations typically maintained two separate systems:

Data Lake

Stores raw data in open formats (Parquet, CSV, JSON) on cheap cloud storage (S3, ADLS, GCS)
Supports all data types: structured, semi-structured, unstructured
Great for data science and ML workloads
Problems: No ACID transactions, no schema enforcement, "data swamp" risk, poor BI performance

Data Warehouse

Stores curated, structured data in proprietary formats
ACID transactions, schema enforcement, fast SQL queries
Great for BI and reporting
Problems: Expensive storage, limited to structured data, vendor lock-in, data duplication from lake

The Two-Tier Pain Points

Issue	Impact
Data duplication	Same data stored in both lake and warehouse
ETL complexity	Complex pipelines to move data between systems
Data staleness	Warehouse data lags behind the lake
Cost	Paying for two separate systems and storage
Governance gaps	Different security models across systems
Limited ML support	Warehouses not designed for ML workloads

The Solution: Lakehouse Architecture

The Lakehouse architecture adds a metadata and management layer (Delta Lake) directly on top of data lake storage:

Key Lakehouse Properties

ACID Transactions: Every write operation is atomic — it either fully succeeds or fully rolls back. No partial writes or corrupted data.
Schema Enforcement and Evolution: Tables have defined schemas. Writes that don't match the schema are rejected. Schemas can evolve safely with mergeSchema.
Time Travel: Every change to a Delta table is versioned. You can query any historical version of the data.
Unified Batch and Streaming: The same Delta table can be written to by batch jobs and read by streaming queries (and vice versa).
Direct BI Access: BI tools connect directly to Lakehouse data — no need to copy data into a separate warehouse.
Open Format: Delta Lake uses Parquet as the underlying file format. Data is not locked into a proprietary system.
Scalable Metadata: The Delta transaction log handles tables with billions of partitions and files efficiently.

Databricks Data Intelligence Platform

Databricks builds on the Lakehouse with the Data Intelligence Platform, which adds:

Component	Purpose
Delta Lake	Storage layer with ACID transactions and versioning
Unity Catalog	Centralized governance, access control, and data lineage
Photon Engine	High-performance C++ query engine for SQL workloads
Serverless Compute	On-demand compute without cluster management
Databricks SQL	SQL analytics and dashboarding on Lakehouse data
Lakeflow	Data pipeline orchestration (Declarative Pipelines + Jobs)
MLflow	ML experiment tracking, model registry, and deployment
Mosaic AI	LLM serving, model training, and AI application development

On the Exam: Understand that the Lakehouse is not just "Delta Lake." It is the combination of open storage format + metadata layer + governance + compute engine that together provide warehouse-like reliability on lake storage.

Databricks Certified Data Engineer Associate

1.1 The Lakehouse Architecture

Key Takeaways

The Lakehouse Architecture

The Problem: Traditional Two-Tier Architecture

Data Lake

Data Warehouse

The Two-Tier Pain Points

The Solution: Lakehouse Architecture

Key Lakehouse Properties

Databricks Data Intelligence Platform

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

1.1 The Lakehouse Architecture

Key Takeaways

The Lakehouse Architecture

The Problem: Traditional Two-Tier Architecture

Data Lake

Data Warehouse

The Two-Tier Pain Points

The Solution: Lakehouse Architecture

Key Lakehouse Properties

Databricks Data Intelligence Platform