3.1 Data Connections & Ingestion

Key Takeaways

Shortcuts virtualize external data into OneLake without copying it; mirroring keeps a near-real-time managed replica of an operational database; Copy/pipeline activities physically move bytes on a schedule.
Dataflow Gen2 is low-code Power Query shaping with built-in staging; data pipelines orchestrate activities and control flow; notebooks give full Spark/code control for heavy or custom transformations.
An on-premises data gateway (standard mode) is required to reach data sources behind a corporate firewall; cloud-to-cloud sources usually do not need a gateway.
A shortcut never duplicates storage or incurs a refresh, so it is the cheapest way to expose ADLS Gen2, S3, GCS, or another Fabric lakehouse table in your workspace.
Mirroring is generally available for Azure SQL Database, SQL Managed Instance, SQL Server, Azure Cosmos DB, Azure Database for PostgreSQL, Snowflake, Oracle, Google BigQuery, and SAP.

Last updated: June 2026

Why This Section Carries the Exam

The Prepare data domain is the single largest scored area on DP-600: Implementing Analytics Solutions Using Microsoft Fabric, weighted at 45-50% in the current 2026 skills outline (effective April 20, 2026), and connectivity is its foundation. The exam itself runs about 40-60 questions in 100 minutes, with a passing score of 700 on a 0-1000 scale, and a heavy diet of case studies, drag-and-drop, and best-answer scenarios. Many items describe a source, a freshness requirement, and a cost or skill constraint, then ask you to pick the cheapest correct ingestion path. Knowing what each option physically does — copy vs.

virtualize vs.

replicate — is what separates a guess from a defensible answer.

The Three Ways Data Reaches OneLake

Every Fabric ingestion question reduces to one of three patterns:

Shortcut — a metadata pointer. The data stays in its original location (ADLS Gen2, Amazon S3, S3-compatible storage, Google Cloud Storage, Dataverse, or another Fabric lakehouse/warehouse). Fabric reads it in place. No storage duplication, no refresh, no ETL.
Mirroring — a managed, near-real-time replica of an operational database. As of 2026, mirroring is generally available for Azure SQL Database, Azure SQL Managed Instance, on-premises/VM SQL Server, Azure Cosmos DB, Azure Database for PostgreSQL, Snowflake, Oracle, Google BigQuery, and SAP, plus open-mirroring partners. Fabric lands the replica as Delta tables in OneLake and keeps it continuously synced. You write no pipeline.
Copy / physical ingestion — a Copy activity in a pipeline, Dataflow Gen2, or a notebook actually moves and lands the bytes. You own the schedule, the transformation, and the storage cost.

When to Use Each

Pattern	Copies data?	Freshness	Author effort	Best when
Shortcut	No	Live (source of truth)	Minimal	Data already in ADLS/S3/GCS/another lakehouse; avoid duplication
Mirroring	Yes (managed)	Near real-time	None (point-and-click)	Need analytical copy of a supported operational DB without ETL
Copy activity	Yes	Scheduled / triggered	Low-medium	Move data on a schedule; full or incremental load
Dataflow Gen2	Yes (staged)	Scheduled	Low-code	Citizen shaping with Power Query, reusable transforms
Notebook	Yes	Scheduled / triggered	Code (Spark)	Heavy volume, complex logic, ML feature prep

Choosing the Transformation Engine

Once data lands, DP-600 still tests how you reshape it. The decision is usually skill-set and scale:

Dataflow Gen2 — Power Query (M) in a visual editor. Built-in staging, hundreds of connectors, and an output destination to lakehouse, warehouse, or KQL database. Best for analysts and reusable, low-code shaping.
Data pipeline — orchestration: Copy activity, control flow (ForEach, If, lookups), parameters, scheduling, and invoking dataflows or notebooks. It moves and coordinates, but is not itself a rich row-level transformer.
Notebook — Spark (PySpark, Spark SQL, Scala, R). Maximum control for large-scale, complex, or programmatic transformations and writing Delta tables to a lakehouse.

A frequent exam trap: a pipeline alone does not do complex column-level transformation — it orchestrates a Dataflow Gen2 or notebook that does. Read the verb in the question; "schedule and coordinate" means pipeline, "reshape and clean" means Dataflow Gen2 or notebook.

Loading diagram...

Ingestion decision flow

Gateways: The On-Premises Gate

Fabric is a cloud service, so it cannot reach a data source sitting behind a corporate firewall on its own. The on-premises data gateway is the secure bridge.

Standard (enterprise) gateway — multi-user, shared across the tenant; used by Dataflow Gen2, semantic models, and pipelines to reach on-premises SQL Server, file shares, SAP, and other firewalled sources.
Personal mode gateway — single user, Power BI Desktop scenarios; not for shared enterprise pipelines.
Virtual network (VNet) data gateway — a managed gateway for securely reaching sources inside an Azure virtual network without managing gateway VMs yourself.

The exam-relevant rule: if the source is on-premises or network-isolated, a gateway is mandatory; pure cloud-to-cloud connections (e.g., Azure SQL Database to Fabric) generally are not. A reusable connection object stores the credentials and binding so multiple items can share authenticated access without re-entering secrets.

Gateway Selection at a Glance

Source location	Gateway needed?	Which gateway
On-prem SQL Server / file share	Yes	Standard (enterprise)
Source inside an Azure VNet	Yes	VNet data gateway
Power BI Desktop personal refresh	Maybe	Personal mode
Azure SQL DB / Cosmos DB (public cloud)	No	None

Full Load vs. Incremental Load

Ingestion questions often add a volume or cost twist:

Full load — reload every row each run. Simple, but expensive and slow at scale.
Incremental load — load only new or changed rows, typically using a watermark column (e.g., LastModified) or change data capture (CDC). Preferred for large, frequently updated tables to reduce capacity consumption and runtime.

Worked example: A 4-billion-row orders table refreshes daily, but only about 2 million rows change. A full Copy each night re-lands all 4 billion rows, burning capacity units and hours. Switching to an incremental Copy that filters WHERE LastModified > @lastWatermark moves roughly 2 million rows and finishes in minutes. When a scenario stresses large tables, daily refresh, and minimizing capacity consumption, the answer is almost always incremental, not full reload — and if you simply need the analytical copy always current with zero authoring, the answer shifts again to mirroring.

Test Your Knowledge

A retail analytics team already keeps several years of sales Parquet files in an existing Azure Data Lake Storage Gen2 account. They want the data available in a Fabric lakehouse for Spark and SQL analysis with the lowest possible storage cost and no duplicate copies. What should they implement?

A OneLake shortcut from the lakehouse to the ADLS Gen2 location

A Copy activity in a data pipeline that lands the files in the lakehouse on a daily schedule

Database mirroring of the ADLS Gen2 account into OneLake

A Dataflow Gen2 that imports all files into a new lakehouse table

Test Your Knowledge

An operations team needs near-real-time analytical access to an Azure SQL Database that powers a transactional app. They want the analytical copy in OneLake to stay continuously synchronized but do not want to build or maintain any ETL pipelines. Which approach best fits?

Schedule a Dataflow Gen2 every 15 minutes to pull changed rows

Create an on-premises data gateway and a nightly Copy activity

Configure Fabric database mirroring for the Azure SQL Database

Build a Spark notebook that reads the source with JDBC on a trigger

Up Next

3.2 Discover Data: OneLake Catalog & Real-Time Hub

Continue learning

Exam DP-600: Implementing Analytics Solutions Using Microsoft Fabric

Azure DP-600

3.1 Data Connections & Ingestion

Key Takeaways

Why This Section Carries the Exam

The Three Ways Data Reaches OneLake

When to Use Each

Choosing the Transformation Engine

Gateways: The On-Premises Gate

Gateway Selection at a Glance

Full Load vs. Incremental Load

Exam DP-600: Implementing Analytics Solutions Using Microsoft Fabric

1DP-600 Exam Overview & Fabric Foundations

2Maintain a Data Analytics Solution (25-30%)

3Prepare Data (45-50%)

4Implement & Manage Semantic Models (25-30%)

5Exam Strategy & Final Preparation

Azure DP-600

3.1 Data Connections & Ingestion

Key Takeaways

Why This Section Carries the Exam

The Three Ways Data Reaches OneLake

When to Use Each

Choosing the Transformation Engine

Gateways: The On-Premises Gate

Gateway Selection at a Glance

Full Load vs. Incremental Load