3.1 Ingest and transform data Overview

Key Takeaways

  • Ingest and transform data is worth 30-35% of the DP-700 exam and splits into loading patterns, batch ingestion/transformation, and streaming ingestion/transformation.
  • Fabric's four core ingestion tools are pipelines (Copy activity), Copy job, Dataflows Gen2, and Spark notebooks; the exam constantly asks you to pick the right one.
  • The medallion architecture stores raw data in Bronze, cleaned/conformed data in Silver, and business-ready aggregates in Gold, usually as Delta tables in OneLake.
  • Shortcuts and mirroring bring external data into OneLake without copying it, while Copy job and pipelines physically move data into lakehouses or warehouses.
  • Transformation languages are tool-bound: PySpark/Spark SQL in notebooks, M in Dataflows Gen2, T-SQL in the warehouse, and KQL in the eventhouse.
Last updated: June 2026

What This Domain Tests

Ingest and transform data is one of three equally weighted functional groups on Exam DP-700, each worth 30-35% of your score. The official skills outline (as of April 20, 2026) breaks this domain into three subsections: design and implement loading patterns, ingest and transform batch data, and ingest and transform streaming data. Together they account for the largest single block of implementation questions you will face, and they assume hands-on fluency with Microsoft Fabric rather than theoretical knowledge.

The core skill being measured is judgment: given a scenario with a data source, a volume, a latency requirement, and a destination, you must select the correct Fabric item and the correct transformation language. Microsoft phrases many questions as "You need to... What should you use?" The exam rewards engineers who know the boundaries of each tool, not just that the tools exist.

The audience profile Microsoft publishes for DP-700 expects subject-matter expertise in data loading patterns, data architectures, and orchestration processes, and explicit skill in manipulating data with SQL, PySpark, and KQL. That three-language requirement is not incidental — it is the spine of this domain. Almost every transformation question hinges on whether the data is in a lakehouse (Spark), a warehouse (T-SQL), or an eventhouse (KQL), and the correct language follows from that store.

Expect case-study and standalone questions that hand you a half-built solution and ask for the next step, so you must reason about end-to-end flows, not isolated features.

The Fabric Ingestion Toolbox

Fabric gives you four primary ways to ingest and transform batch data, plus a real-time stack for streaming. The table below maps each tool to the job it does best and the transformation language it uses.

ToolBest forTransformation languageCode level
Data pipeline (Copy activity)Fast EL movement of large volumes; orchestrationLimited (orchestration only)Low/no-code
Copy jobSimplified incremental/CDC copy with built-in state trackingNone (movement only)No-code
Dataflow Gen2Power Query transformations, citizen-developer ETLM (Power Query)Low-code
Spark notebookComplex/large-scale transforms, ML, custom logicPySpark, Scala, Spark SQLPro-code
EventstreamCapturing and routing streaming eventsEvent processor (no-code) / SQLNo-code
Eventhouse / KQL DBReal-time analytics on high-volume telemetryKQLPro-code

A recurring exam theme is that transformation language is bound to the tool. You cannot write T-SQL in a Dataflow, and you cannot run Power Query M in a warehouse. Choosing the destination often dictates which language you can use downstream.

OneLake, Delta, and the Medallion Pattern

All Fabric data lands in OneLake, a single logical data lake for the tenant. Lakehouse tables are stored in Delta Lake (Delta-Parquet) format, which adds ACID transactions, time travel, and schema enforcement on top of Parquet files. The warehouse also persists its tables as Delta in OneLake, which is why a warehouse table is automatically queryable from a lakehouse SQL analytics endpoint and vice versa.

Microsoft's recommended design is the medallion architecture, which organizes data into three quality tiers:

  • Bronze (raw): data ingested exactly as it arrives, with no transformation. Bronze is your replayable source of truth.
  • Silver (enriched): cleaned, deduplicated, type-cast, and conformed data, often joined across sources.
  • Gold (curated): business-ready, aggregated, dimensional-model tables that semantic models and reports consume.

A common physical pattern uses lakehouses for Bronze and Silver (Spark-friendly, schema-on-read) and a warehouse for Gold (T-SQL, multi-table transactions, fine-grained security) so business users hit a familiar SQL endpoint. You move data between layers with notebooks, Dataflows, or pipelines depending on transformation complexity.

Bringing Data In Without Copying It

Not every ingestion physically moves bytes. Two virtualization features show up repeatedly on the exam:

  • Shortcuts create a pointer in OneLake to data that lives elsewhere — another lakehouse, Azure Data Lake Storage Gen2, Amazon S3, Google Cloud Storage, or Dataverse. The data is not copied; it is read in place, which avoids duplication and keeps a single copy of truth. Shortcuts are ideal for landing external storage into a Bronze layer.
  • Mirroring continuously replicates an operational database (Azure SQL Database, Azure Cosmos DB, Snowflake, and others) into OneLake as Delta tables in near real time, at no extra storage cost for the replica.

These contrast with Copy job and Copy activity, which physically land data. Knowing the difference — virtualize in place versus physically copy — is essential for the "least-effort / lowest-cost" style questions that dominate this domain.

How the three subsections build on each other

The domain's structure is deliberately sequential. Loading patterns comes first because every later choice depends on whether you need a full load, an incremental load, or a streaming load, and whether the target is a dimensional model. Batch ingestion and transformation then tests the tool matrix — pick a data store, choose between Dataflows Gen2, notebooks, KQL, and T-SQL, manage shortcuts and mirroring, and apply transforms such as denormalizing, grouping, aggregating, and handling duplicate, missing, and late-arriving rows.

Streaming ingestion and transformation closes the domain with Eventstreams, Spark structured streaming, KQL, native-table-versus-shortcut decisions in Real-Time Intelligence, and windowing functions. Treat the medallion architecture as the connective tissue: nearly every scenario can be located somewhere on the Bronze-Silver-Gold path, and identifying which layer the question is about usually reveals the intended tool and language.

Test Your Knowledge

A team needs to make data in an existing Amazon S3 bucket queryable from a Fabric lakehouse with the least effort and without duplicating storage. What should they use?

A
B
C
D
Test Your Knowledge

In a standard Fabric medallion architecture, which layer holds cleaned, deduplicated, and conformed data that is not yet aggregated for reporting?

A
B
C
D
Test Your Knowledge

Which storage format do Fabric lakehouse tables use, giving them ACID transactions and time travel on top of columnar files?

A
B
C
D