5.3 Azure Data Factory and Pipelines

Key Takeaways

  • Azure Data Factory is built from linked services, datasets, activities, pipelines, and triggers, executed by integration runtimes.
  • Use Azure IR for cloud-to-cloud work, Self-hosted IR to reach on-prem or firewalled sources, and Azure-SSIS IR to lift and shift existing SSIS packages.
  • Mapping data flows give code-free, visual transformations that ADF compiles into Spark jobs under the hood.
  • Tumbling window triggers support backfill and run-to-run dependency, making them the natural choice for incremental loads.
  • Synapse pipelines and Microsoft Fabric Data Factory are repackaged subsets of ADF using the same activities, datasets, and triggers.
Last updated: June 2026

Azure Data Factory and Pipelines

Azure Data Factory (ADF) is Microsoft's cloud-native data integration service for extract-transform-load (ETL) and extract-load-transform (ELT) at scale. It moves and reshapes data across 100+ connectors — on-premises systems, SaaS apps, Azure services, AWS, GCP — without you writing custom code for each source.

Quick Answer: ADF orchestrates data movement using five core concepts: linked services (connection strings), datasets (named pointers to data), activities (units of work), pipelines (DAGs of activities), and triggers (when to run). It executes those activities on integration runtimes — Azure IR for cloud, Self-hosted IR for on-prem, Azure-SSIS IR for legacy SSIS packages.

The Five Core Concepts

ConceptWhat it isTypical example
Linked serviceA connection string to a data source or compute"AzureSqlDb-Sales" pointing at a production database
DatasetA named view of data inside a linked serviceThe dbo.Orders table inside that database
ActivityOne unit of workCopy, Lookup, ForEach, Stored Procedure, Notebook
PipelineAn ordered or parallel group of activities"Daily Sales Refresh"
TriggerA rule that starts a pipelineSchedule, tumbling window, storage event, manual

Once you understand this hierarchy, every ADF question on the exam becomes a matter of identifying which concept the scenario describes.

Integration Runtimes (IR)

The integration runtime is the compute that actually runs activities. ADF gives you three flavors, and choosing the right one is a frequent exam topic.

  • Azure IR — Fully managed Microsoft-hosted compute that runs in Azure regions. Use it for cloud-to-cloud data movement and for running mapping data flows. No infrastructure to manage.
  • Self-hosted IR — A small agent you install on a Windows VM (on-premises or in another cloud). Use it whenever ADF needs to reach a data source behind a firewall, such as an on-prem SQL Server, file share, or SAP system. You can scale out by installing the agent on multiple nodes.
  • Azure-SSIS IR — A managed cluster of Azure VMs that runs SQL Server Integration Services (SSIS) packages. Use it to lift-and-shift existing SSIS workloads to Azure without rewriting them.

Common Activity Types

  • Copy activity — The workhorse: reads from a source dataset, writes to a sink dataset, optionally applying simple schema mapping and compression. Used for ingestion into the lake or warehouse.
  • Lookup activity — Returns a value or rowset from a source so later activities can use it (for example, the maximum modified date for an incremental load).
  • ForEach activity — Iterates over a list and runs inner activities in parallel or sequentially. Perfect for "copy 50 tables" or "process every folder."
  • If Condition / Switch — Branch the pipeline based on a value or expression.
  • Until — Loop until a condition becomes true.
  • Get Metadata — Inspect file or folder attributes (size, last modified, child items).
  • Web / Webhook — Call REST endpoints.
  • Notebook / Stored Procedure / Spark — Invoke external compute to do the actual transformation.
  • Execute Pipeline — Call another pipeline as a child.

These control-flow activities make ADF a real orchestration engine, not just a copy tool.

Mapping Data Flows (Code-Free Transformation)

A mapping data flow is ADF's visual, drag-and-drop transformation designer. You drop sources, transformations (join, derive, aggregate, pivot, window, surrogate key), and sinks on a canvas, and ADF compiles the result into a Spark job that runs on an Azure IR-managed Spark cluster behind the scenes. You get the power of Spark without writing PySpark.

Wrangling data flows are a separate, less common option that gives Power Query users a familiar M-language experience for self-service prep. Mapping data flows are the more frequent exam answer for code-free ETL.

Triggers

Pipelines do not run on their own. You attach a trigger:

  • Schedule trigger — Run at a fixed cron-style time (every day at 2 AM, every Monday).
  • Tumbling window trigger — Run in non-overlapping, time-boxed windows; supports backfill and dependency on previous runs. Ideal for incremental loads.
  • Storage event trigger — Fire when a blob is created or deleted in ADLS Gen2 or Blob Storage. Perfect for arrival-driven pipelines.
  • Custom event trigger — Fire from an Azure Event Grid topic.
  • Manual — Trigger on demand from the portal, REST API, PowerShell, or Azure CLI.

Synapse Pipelines Are an ADF Subset

The pipeline engine inside Azure Synapse Analytics uses the same concepts and most of the same activities as ADF. The differences are mostly packaging: Synapse pipelines run inside a Synapse workspace, do not support all ADF features (such as SSIS IR), but add tight integration with Synapse SQL pools and Spark pools. Microsoft Fabric Data Factory uses the same model again, repackaged as a SaaS experience inside Fabric.

A Typical ADF Pattern

  1. A schedule trigger kicks off the pipeline at 2 AM.
  2. A Lookup reads the maximum OrderId already in the warehouse.
  3. A ForEach iterates over source tables. Inside the loop a Copy activity uses a Self-hosted IR to pull new rows from on-prem SQL Server into ADLS Gen2.
  4. A Notebook activity calls Databricks to transform the bronze files into silver Delta tables.
  5. A Stored Procedure activity refreshes the gold dimensional model in the dedicated SQL pool.
  6. A Web activity calls Power BI's REST API to refresh the dataset.
Test Your Knowledge

An Azure Data Factory pipeline needs to copy data from an on-premises SQL Server database behind a corporate firewall into Azure Data Lake Storage Gen2. Which integration runtime should you use?

A
B
C
D
Test Your Knowledge

An ADF pipeline must iterate over 30 source tables and copy each one into ADLS Gen2 in parallel. Which combination of components is most appropriate?

A
B
C
D