1.1 The Lakehouse Architecture

Key Takeaways

  • The lakehouse unifies the low-cost, open storage of a data lake with the ACID transactions, governance, and performance of a data warehouse on one copy of data.
  • Delta Lake is the open storage layer that turns commodity object storage (S3, ADLS, GCS) into a transactional lakehouse by adding a JSON transaction log on top of Parquet.
  • The medallion architecture organizes data into Bronze (raw), Silver (cleansed/conformed), and Gold (business-level aggregate) layers for progressive refinement.
  • Separating compute from storage lets multiple engines (Spark, Photon, SQL warehouses) read the same governed tables without copying data.
  • Unity Catalog provides one governance layer across all workspaces, so tables, volumes, and models share a single permission and lineage model.
Last updated: June 2026

What Problem the Lakehouse Solves

Before the lakehouse, organizations ran two disconnected systems. A data lake stored raw files (JSON, CSV, Parquet) cheaply on object storage but had no transactions, no schema enforcement, and poor query performance — it easily degraded into a "data swamp." A separate data warehouse offered fast SQL, ACID guarantees, and governance, but required copying data into a proprietary format, was expensive, and could not handle unstructured data, machine learning, or streaming well.

The two-system pattern forced teams to maintain brittle ETL that copied data from the lake into the warehouse, creating stale duplicates, extra cost, and governance gaps. The lakehouse collapses these into one architecture: a single copy of data in open formats on cheap object storage, with a transactional metadata layer that delivers warehouse-grade reliability and speed directly on the lake.

The Databricks Data Intelligence Platform

Databricks brands its lakehouse as the Data Intelligence Platform. Its defining traits are:

CapabilityHow the lakehouse delivers it
Open storageData lives as Parquet files in your own cloud object store (S3, ADLS, GCS)
ACID transactionsThe Delta Lake transaction log makes concurrent reads/writes safe
Schema enforcement & evolutionBad-shape writes are rejected; intended changes are merged in
Decoupled computeMany engines read the same tables; storage and compute scale independently
Unified governanceUnity Catalog secures tables, volumes, and models across workspaces
All workloadsSQL analytics, ETL, streaming, BI, and ML run on the same data

Because storage is decoupled from compute, you can spin up a SQL warehouse, an all-purpose cluster, and a jobs pipeline that all read the identical governed Delta tables — with no data movement and no copies to keep in sync.

The Medallion Architecture

Databricks recommends organizing lakehouse data into three quality tiers, the medallion (or multi-hop) architecture. Data flows forward, getting cleaner and more business-ready at each hop:

  • Bronze (raw): Ingested data landed as-is from source systems, with little or no transformation. Bronze preserves a full, replayable history of what arrived, including ingestion metadata (load time, source file). It is the system of record for reprocessing.
  • Silver (cleansed/conformed): Data is filtered, deduplicated, type-cast, and joined into a clean, queryable model. Silver enforces data-quality rules and conforms entities (one consistent customer, product, etc.) across sources.
  • Gold (curated): Business-level aggregates and project-specific tables optimized for analytics, dashboards, and reporting — for example, daily revenue by region or features for a model.

Moving validation downstream means a bad source file corrupts only Bronze; you can fix logic and rebuild Silver and Gold without re-ingesting. This progressive refinement is the backbone of reliable pipelines on the platform.

Delta Lake and Unity Catalog as Foundations

Two components make the lakehouse real on Databricks. Delta Lake is the open-source storage layer that adds a transaction log on top of Parquet, providing ACID guarantees, time travel, and schema management — every managed table on Databricks is a Delta table by default. Unity Catalog is the unified governance layer using a three-level namespace, catalog.schema.table, that centralizes access control, auditing, lineage, and discovery across every workspace in an account.

For the exam, remember the core value proposition: one copy of open-format data, made reliable by Delta Lake, governed by Unity Catalog, and served to every workload — without the cost and drift of a separate warehouse.

Decoupled Compute and Storage

A defining property of the lakehouse is the separation of compute from storage. In a classic warehouse the two are bound together, so to query more data you must scale the whole appliance, and idle compute still costs money. On the lakehouse, data sits in your cloud object store as a durable, independent layer, and compute is provisioned separately and on demand. The practical consequences matter for everyday engineering:

  • Independent scaling: you grow storage simply by writing more files; you grow compute by adding clusters or larger warehouses, and the two never have to move together.
  • Many engines, one copy: an interactive notebook cluster, a scheduled jobs pipeline, a serverless SQL warehouse, and an external BI tool can all read the same Delta tables concurrently. No engine owns the data, and there is no extract step to keep current.
  • Elastic cost: compute can autoscale and auto-terminate when idle, so you pay only while actually processing — something a coupled warehouse cannot do.

This design is what lets a single governed table serve SQL analytics, machine learning, and streaming at once. It also explains why the platform talks about workloads rather than databases: the table is the durable asset, and any engine attaches to it.

How the Pieces Fit Together

Putting the layers in order clarifies the whole architecture. Cloud object storage holds the bytes cheaply and durably. Delta Lake wraps those Parquet files with a transaction log to add ACID reliability, time travel, and schema management. Unity Catalog governs the resulting tables, volumes, and models with one permission and lineage model across every workspace. The medallion architecture organizes the data into Bronze, Silver, and Gold quality tiers as it is refined. Finally, compute — clusters, jobs, and SQL warehouses, accelerated by Photon — runs the actual work against that governed data.

Internalizing this stack, from raw object storage up through governed, query-ready Gold tables, is the mental model the rest of the exam builds on, and it is the reason the lakehouse can replace a separate lake-plus-warehouse pipeline with a single, reliable system.

Test Your Knowledge

What is the central architectural advantage of the lakehouse over the traditional two-tier (data lake + separate data warehouse) approach?

A
B
C
D
Test Your Knowledge

In the medallion architecture, which layer holds raw, unprocessed data ingested as-is from source systems?

A
B
C
D
Test Your Knowledge

Which two technologies are the foundational layers that make the lakehouse transactional and governed on Databricks?

A
B
C
D