5.1 Modern Data Warehouse and Lakehouse

Key Takeaways

  • Every Azure analytics workload follows the same five stages: ingest, store, process, serve, and visualize.
  • Data warehouses use schema-on-write for fast BI; data lakes use schema-on-read for cheap raw storage; lakehouses combine both using Delta Lake on Parquet.
  • OneLake is the single, tenant-wide data lake automatically provisioned with every Microsoft Fabric tenant, storing all items as Delta-Parquet.
  • OneLake Shortcuts let Fabric query data in ADLS Gen2, Amazon S3, or Google Cloud Storage in place without copying it.
  • The medallion architecture organizes lakehouse data into bronze (raw), silver (cleansed), and gold (curated business-ready) zones.
Last updated: June 2026

Modern Data Warehouse and Lakehouse

Quick Answer: A modern data warehouse on Azure combines a data lake (cheap, schema-on-read storage for raw and refined files) with a SQL engine (structured, governed serving layer). The lakehouse pattern collapses both layers into one open storage tier — typically Delta Lake on Azure Data Lake Storage Gen2 or OneLake in Microsoft Fabric — so warehouse-style SQL, Spark notebooks, and BI tools all query the same files.

For the DP-900 exam, you need to recognize the five logical stages of any analytics platform, how the three common storage patterns differ, and where Microsoft's current services fit each stage.

The Five-Stage Reference Architecture

Microsoft documents every analytics workload as a pipeline of five logical stages. Knowing which Azure service plays which role is the single most useful mental model for the exam.

StagePurposeCommon Azure services (2026)
1. IngestPull data from operational systems and streamsAzure Data Factory, Synapse Pipelines, Fabric Data Factory, Event Hubs, IoT Hub
2. StoreLand raw and curated data cheaply and durablyAzure Data Lake Storage Gen2, OneLake, Blob Storage
3. Process / TransformClean, join, aggregate, and model the dataSynapse Spark pools, Azure Databricks, Fabric Data Engineering, dataflows
4. Serve / ModelExpose governed, query-ready datasetsSynapse dedicated SQL pool, Fabric Warehouse, Fabric Lakehouse SQL endpoint
5. Visualize / ConsumeDeliver insight to humans and appsPower BI, Excel, custom apps via REST/ODBC

This same flow appears in Microsoft's reference architectures for the modern data warehouse, advanced analytics, and real-time analytics scenarios — only the service mix changes.

Three Storage Patterns You Must Distinguish

Data warehouse — A relational store optimized for analytical queries. Data is schema-on-write: it is cleaned, modeled into star or snowflake schemas, and loaded before it can be queried. Strong governance and BLAZING query speed on structured data, but loading is expensive and unstructured data (images, audio, logs) does not fit naturally.

Data lake — A flat folder hierarchy that stores files in any format (CSV, JSON, Parquet, images). It is schema-on-read: structure is applied only when something queries the file. Cheap, infinitely scalable, and friendly to data scientists, but without governance a lake degrades into a data swamp with duplicate, untrusted files.

Lakehouse — A single tier that gives you lake economics and warehouse semantics. It is built on an open table format — Delta Lake (the default in Microsoft Fabric and Databricks), Apache Iceberg, or Apache Hudi — that adds ACID transactions, time travel, schema enforcement, and indexing on top of Parquet files in the lake. SQL engines and Spark engines query the same files.

CapabilityWarehouseLakeLakehouse
Stores unstructured dataNoYesYes
ACID transactionsYesNoYes
Schema enforcementYesNoOptional
BI tool friendlyExcellentPoorGood
Data science / ML friendlyLimitedExcellentExcellent
Storage cost per TBHighLowLow
Open format (no vendor lock-in)NoYesYes

OneLake: One Lake for the Whole Tenant

OneLake is Microsoft Fabric's built-in, tenant-wide data lake. It is automatically provisioned with every Fabric tenant — there is exactly one OneLake per tenant, much like there is one OneDrive per user. Inside OneLake, data is organized into workspaces and then into items (lakehouses, warehouses, KQL databases). All items store data as Delta-Parquet files under the hood, so any Fabric engine — Spark, T-SQL, KQL, Power BI Direct Lake — can read the same physical data without copies.

Shortcuts let one OneLake folder point at data that already lives in Azure Data Lake Storage Gen2, Amazon S3, or Google Cloud Storage. The bytes are not moved or duplicated; Fabric queries them in place. This is what Microsoft means by the slogan "one copy of data for all your analytics."

The Medallion Architecture

Both lakehouses and warehouses commonly use a medallion layout to organize data quality zones:

  • Bronze (raw) — landed exactly as it arrived from the source. Append-only.
  • Silver (cleansed) — deduplicated, validated, type-corrected, conformed to common keys.
  • Gold (curated) — business-level aggregates, star schemas, ready for Power BI and reports.

You move data from bronze to silver to gold using Spark notebooks, dataflows, or stored procedures. Each layer is queryable, which makes lineage and debugging far easier than a monolithic ETL job.

When Each Pattern Wins

  • Choose a pure warehouse (Fabric Warehouse, Synapse dedicated SQL pool) when the workload is structured, the users are BI analysts, and predictable performance is more important than open formats.
  • Choose a pure data lake (ADLS Gen2 + Synapse serverless) when you have huge volumes of raw and semi-structured data, ad-hoc exploration dominates, and cost matters more than millisecond latency.
  • Choose a lakehouse (Fabric Lakehouse, Databricks) when you want one storage tier to serve both BI and data science teams, with ACID guarantees on Parquet.
Test Your Knowledge

An organization wants one storage tier that supports both Power BI reporting and data science notebooks, with ACID transactions on Parquet files and no vendor lock-in. Which pattern best fits?

A
B
C
D
Test Your Knowledge

In a Microsoft Fabric medallion architecture, which layer is appropriate for a Power BI semantic model that drives an executive dashboard?

A
B
C
D