5.1 Modern Data Warehouse and Lakehouse
Key Takeaways
- Every Azure analytics workload follows the same five stages: ingest, store, process, serve, and visualize.
- Data warehouses use schema-on-write for fast BI; data lakes use schema-on-read for cheap raw storage; lakehouses combine both using Delta Lake on Parquet.
- OneLake is the single, tenant-wide data lake automatically provisioned with every Microsoft Fabric tenant, storing all items as Delta-Parquet.
- OneLake Shortcuts let Fabric query data in ADLS Gen2, Amazon S3, or Google Cloud Storage in place without copying it.
- The medallion architecture organizes lakehouse data into bronze (raw), silver (cleansed), and gold (curated business-ready) zones.
Modern Data Warehouse and Lakehouse
Quick Answer: A modern data warehouse on Azure combines a data lake (cheap, schema-on-read storage for raw and refined files) with a SQL engine (structured, governed serving layer). The lakehouse pattern collapses both layers into one open storage tier — typically Delta Lake on Azure Data Lake Storage Gen2 or OneLake in Microsoft Fabric — so warehouse-style SQL, Spark notebooks, and BI tools all query the same files.
For the DP-900 exam, you need to recognize the five logical stages of any analytics platform, how the three common storage patterns differ, and where Microsoft's current services fit each stage.
The Five-Stage Reference Architecture
Microsoft documents every analytics workload as a pipeline of five logical stages. Knowing which Azure service plays which role is the single most useful mental model for the exam.
| Stage | Purpose | Common Azure services (2026) |
|---|---|---|
| 1. Ingest | Pull data from operational systems and streams | Azure Data Factory, Synapse Pipelines, Fabric Data Factory, Event Hubs, IoT Hub |
| 2. Store | Land raw and curated data cheaply and durably | Azure Data Lake Storage Gen2, OneLake, Blob Storage |
| 3. Process / Transform | Clean, join, aggregate, and model the data | Synapse Spark pools, Azure Databricks, Fabric Data Engineering, dataflows |
| 4. Serve / Model | Expose governed, query-ready datasets | Synapse dedicated SQL pool, Fabric Warehouse, Fabric Lakehouse SQL endpoint |
| 5. Visualize / Consume | Deliver insight to humans and apps | Power BI, Excel, custom apps via REST/ODBC |
This same flow appears in Microsoft's reference architectures for the modern data warehouse, advanced analytics, and real-time analytics scenarios — only the service mix changes.
Three Storage Patterns You Must Distinguish
Data warehouse — A relational store optimized for analytical queries. Data is schema-on-write: it is cleaned, modeled into star or snowflake schemas, and loaded before it can be queried. Strong governance and BLAZING query speed on structured data, but loading is expensive and unstructured data (images, audio, logs) does not fit naturally.
Data lake — A flat folder hierarchy that stores files in any format (CSV, JSON, Parquet, images). It is schema-on-read: structure is applied only when something queries the file. Cheap, infinitely scalable, and friendly to data scientists, but without governance a lake degrades into a data swamp with duplicate, untrusted files.
Lakehouse — A single tier that gives you lake economics and warehouse semantics. It is built on an open table format — Delta Lake (the default in Microsoft Fabric and Databricks), Apache Iceberg, or Apache Hudi — that adds ACID transactions, time travel, schema enforcement, and indexing on top of Parquet files in the lake. SQL engines and Spark engines query the same files.
| Capability | Warehouse | Lake | Lakehouse |
|---|---|---|---|
| Stores unstructured data | No | Yes | Yes |
| ACID transactions | Yes | No | Yes |
| Schema enforcement | Yes | No | Optional |
| BI tool friendly | Excellent | Poor | Good |
| Data science / ML friendly | Limited | Excellent | Excellent |
| Storage cost per TB | High | Low | Low |
| Open format (no vendor lock-in) | No | Yes | Yes |
OneLake: One Lake for the Whole Tenant
OneLake is Microsoft Fabric's built-in, tenant-wide data lake. It is automatically provisioned with every Fabric tenant — there is exactly one OneLake per tenant, much like there is one OneDrive per user. Inside OneLake, data is organized into workspaces and then into items (lakehouses, warehouses, KQL databases). All items store data as Delta-Parquet files under the hood, so any Fabric engine — Spark, T-SQL, KQL, Power BI Direct Lake — can read the same physical data without copies.
Shortcuts let one OneLake folder point at data that already lives in Azure Data Lake Storage Gen2, Amazon S3, or Google Cloud Storage. The bytes are not moved or duplicated; Fabric queries them in place. This is what Microsoft means by the slogan "one copy of data for all your analytics."
The Medallion Architecture
Both lakehouses and warehouses commonly use a medallion layout to organize data quality zones:
- Bronze (raw) — landed exactly as it arrived from the source. Append-only.
- Silver (cleansed) — deduplicated, validated, type-corrected, conformed to common keys.
- Gold (curated) — business-level aggregates, star schemas, ready for Power BI and reports.
You move data from bronze to silver to gold using Spark notebooks, dataflows, or stored procedures. Each layer is queryable, which makes lineage and debugging far easier than a monolithic ETL job.
When Each Pattern Wins
- Choose a pure warehouse (Fabric Warehouse, Synapse dedicated SQL pool) when the workload is structured, the users are BI analysts, and predictable performance is more important than open formats.
- Choose a pure data lake (ADLS Gen2 + Synapse serverless) when you have huge volumes of raw and semi-structured data, ad-hoc exploration dominates, and cost matters more than millisecond latency.
- Choose a lakehouse (Fabric Lakehouse, Databricks) when you want one storage tier to serve both BI and data science teams, with ACID guarantees on Parquet.
An organization wants one storage tier that supports both Power BI reporting and data science notebooks, with ACID transactions on Parquet files and no vendor lock-in. Which pattern best fits?
In a Microsoft Fabric medallion architecture, which layer is appropriate for a Power BI semantic model that drives an executive dashboard?