3.2 Core Workflows and Decision Points

Key Takeaways

  • Use Copy activity for fast bulk EL movement, Copy job for simplified incremental/CDC copy, Dataflow Gen2 for Power Query transforms, and notebooks for complex or large-scale Spark logic.
  • Copy job tracks last-run state automatically and supports watermark-based and CDC-based incremental copy, removing the need to build watermark logic by hand.
  • Incremental loading needs a reliable, monotonically increasing watermark column such as ModifiedDate or RowVersion; back-datable columns silently miss rows.
  • Dataflow Gen2 destinations support Append (add new rows) and Replace (truncate-and-reload); Append suits incremental loads and Replace suits full refresh.
  • Pipelines orchestrate with control-flow activities (ForEach, Lookup, If Condition, Until) and dynamic expressions, and can run a Dataflow or notebook as an activity.
Last updated: June 2026

Choosing the Right Ingestion Tool

The single most tested decision in this domain is which tool to use. Microsoft publishes a decision guide, and the exam expects you to reproduce its logic. Apply this hierarchy:

  • Copy activity (in a pipeline) — the best low/no-code choice to move large volumes (up to petabytes) from many sources into a lakehouse or warehouse, ad hoc or scheduled. Use it for fast extract-and-load (EL) jobs that need no joins or business logic, just movement and basic mapping.
  • Copy job — a simplified, wizard-driven copy experience that natively supports bulk copy, incremental copy, and CDC replication. It automatically tracks the state of the last successful run, so you do not build watermark plumbing yourself. Choose it when you need recurring incremental loads with minimal effort.
  • Dataflow Gen2 — choose it when the transformation can be expressed in Power Query (M): cleaning, reshaping, merges, and business-logic transforms before Silver, especially for analysts and citizen developers.
  • Notebook (Spark) — choose it for complex transformations, large datasets, ML, or custom incremental logic (slowly changing dimensions, schema inference, window functions over hundreds of millions of rows). The Mashup engine behind Dataflows runs single-node and hits memory limits where Spark distributes across nodes.

Think of these as a ladder of increasing code and power. Copy activity and Copy job sit at the no-code movement end; Dataflow Gen2 adds low-code Power Query transforms; the notebook is the pro-code escape hatch for anything the others cannot express. When two tools could both work, the exam almost always wants the simplest tool that meets the requirement — choosing a heavier tool than necessary is the wrong answer, and so is choosing a lighter tool that cannot scale to the stated volume.

Cost and Scale Signals

Exam scenarios bury hints that point to one tool. Watch for these signals:

Signal in the scenarioPoints to
"No transformation, just move it fast"Copy activity
"Recurring incremental load, least effort"Copy job
"Analysts / citizen developers / Power Query"Dataflow Gen2
"200M-row fact table, complex window functions"Notebook (Spark)
"Change data capture from a SQL source"Copy job (CDC) or mirroring
"Machine learning / data science workflow"Notebook (Spark)

A Spark session reserves compute (roughly 4-8 capacity units per second) for its whole duration, so spinning up a notebook for a tiny incremental load is wasteful — that is a classic distractor. Conversely, forcing a 200-million-row transform through a single-node Dataflow is the wrong answer because it will run out of memory or run for hours.

Full vs Incremental Loads

A full load truncates the destination and reloads every row each run — simple, idempotent, but expensive at scale. An incremental load moves only rows that are new or changed since the last run, which is essential for large or frequently updated sources.

Incremental loading needs a reliable change-tracking mechanism:

  1. Watermark column — a monotonically increasing column such as ModifiedDate, LastUpdated, or RowVersion. Each run reads rows greater than the stored high-water mark, then advances the watermark. The column must never be back-dated and must have no gaps, or rows are silently skipped — prefer system-maintained timestamps over user-editable ones.
  2. Change Data Capture (CDC) — when the source database emits change events, Copy job and mirroring can replicate inserts, updates, and deletes directly.

In a pipeline you implement a watermark pattern with a Lookup activity (read the stored watermark), a Copy activity (filtered by the watermark), and a second write to advance the watermark. Copy job does all of this automatically by tracking last-run state — which is why it is the "least effort" answer. Copy job's watermark mode accepts ROWVERSION, datetime, date, integer, and string-interpreted-as-datetime columns, and its CDC mode replicates inserts, updates, and deletes when CDC is enabled on the source.

For streaming loads, the loading pattern is different again: data arrives continuously, so you design for append-only writes and windowed aggregation rather than periodic full or incremental batches. The exam's loading-pattern subsection explicitly lists "design and implement a loading pattern for streaming data" alongside full and incremental, so be ready to recognize when a requirement (sub-second latency, IoT telemetry, event-driven) rules out batch entirely and demands Eventstream or structured streaming.

Dataflow Destinations and Pipeline Control Flow

Dataflow Gen2 writes to a configurable data destination — Lakehouse table, Warehouse table, Azure SQL Database, or KQL Database — with two update methods:

  • Replace: every refresh removes existing destination data and writes the full output. Use it for full refresh.
  • Append: every refresh adds the output rows to the existing table. Use it for daily incremental loads.

Dataflows can also stage query output to the linked lakehouse to enable the enhanced compute engine for fold-friendly joins and large transforms.

Pipelines orchestrate everything. Beyond Copy and Dataflow activities, control-flow activities give you logic:

  • ForEach — iterate over a collection (for example, copy a list of tables).
  • Lookup / Get Metadata — read a value or list files to drive the run.
  • If Condition and Switch — branch on a value.
  • Until — loop until a condition is met.

Dynamic expressions (using the @{...} syntax and pipeline parameters/variables) make pipelines reusable — for example, @item().tableName inside a ForEach to copy many tables with one parameterized Copy activity.

Test Your Knowledge

You must set up a recurring incremental load from an on-premises SQL Server using change data capture, with the least development effort and no hand-built watermark logic. Which Fabric capability fits best?

A
B
C
D
Test Your Knowledge

A Dataflow Gen2 loads a daily batch of new sales rows and must add them to the existing destination table without removing prior days. Which destination update method should be configured?

A
B
C
D
Test Your Knowledge

In a pipeline that must copy 40 source tables using one parameterized Copy activity, which control-flow activity drives the iteration?

A
B
C
D