3.1 Data Connections & Ingestion

Key Takeaways

  • Shortcuts virtualize external data into OneLake without copying it; mirroring keeps a near-real-time managed replica of an operational database; Copy/pipeline activities physically move bytes on a schedule.
  • Dataflow Gen2 is low-code Power Query shaping with built-in staging; data pipelines orchestrate activities and control flow; notebooks give full Spark/code control for heavy or custom transformations.
  • An on-premises data gateway (standard mode) is required to reach data sources behind a corporate firewall; cloud-to-cloud sources usually do not need a gateway.
  • A shortcut never duplicates storage or incurs a refresh, so it is the cheapest way to expose ADLS Gen2, S3, or another Fabric lakehouse table in your workspace.
  • Choose mirroring when you need queryable analytical copies of Azure SQL, Cosmos DB, or Snowflake that stay continuously synced without authoring ETL.
Last updated: May 2026

Why This Section Carries the Exam

The Prepare data domain is the single largest scored area on DP-600 at 45-50%, and connectivity is its foundation. Many case-study and drag-and-drop items describe a source, a freshness requirement, and a cost or skill constraint, then ask you to pick the cheapest correct ingestion path. Knowing what each option physically does — copy vs. virtualize vs. replicate — is what separates a guess from a defensible answer.

The Three Ways Data Reaches OneLake

Every Fabric ingestion question reduces to one of three patterns:

  • Shortcut — a metadata pointer. The data stays in its original location (ADLS Gen2, Amazon S3, Google Cloud Storage, Dataverse, or another Fabric lakehouse/warehouse). Fabric reads it in place. No storage duplication, no refresh, no ETL.
  • Mirroring — a managed, near-real-time replica of an operational database (Azure SQL Database, Azure SQL Managed Instance, Azure Cosmos DB, Snowflake, and an expanding source list). Fabric keeps the analytical copy in OneLake continuously synced. You write no pipeline.
  • Copy / physical ingestion — Copy activity in a pipeline, Dataflow Gen2, or a notebook actually moves and lands the bytes. You own the schedule, the transformation, and the storage cost.

When to Use Each

PatternCopies data?FreshnessAuthor effortBest when
ShortcutNoLive (source of truth)MinimalData already in ADLS/S3/another lakehouse; avoid duplication
MirroringYes (managed)Near real-timeNone (point-and-click)Need analytical copy of an operational DB without ETL
Copy activityYesScheduled / triggeredLow-mediumMove data on a schedule; full or incremental load
Dataflow Gen2Yes (staged)ScheduledLow-codeCitizen shaping with Power Query, reusable transforms
NotebookYesScheduled / triggeredCode (Spark)Heavy volume, complex logic, ML feature prep

Choosing the Transformation Engine

Once data lands, DP-600 still tests how you reshape it. The decision is usually skill-set and scale:

  • Dataflow Gen2 — Power Query (M) in a visual editor. Built-in staging, hundreds of connectors, output destination to lakehouse/warehouse/KQL. Best for analysts and reusable, low-code shaping.
  • Data pipeline — orchestration: Copy activity, control flow (ForEach, If, lookups), parameters, scheduling, and invoking dataflows or notebooks. It moves and coordinates, but is not itself a rich row-level transformer.
  • Notebook — Spark (PySpark, Spark SQL, Scala, R). Maximum control for large-scale, complex, or programmatic transformations and writing Delta tables to a lakehouse.

A frequent exam trap: a pipeline alone does not do complex column-level transformation — it orchestrates a Dataflow Gen2 or notebook that does.

Loading diagram...
Ingestion decision flow

Gateways: The On-Premises Gate

Fabric is a cloud service, so it cannot reach a data source sitting behind a corporate firewall on its own. The on-premises data gateway is the secure bridge.

  • Standard (enterprise) gateway — multi-user, shared across the tenant; used by Dataflow Gen2, datasets/semantic models, and pipelines to reach on-premises SQL Server, file shares, SAP, and other firewalled sources.
  • Personal mode gateway — single user, Power BI Desktop scenarios; not for shared enterprise pipelines.
  • Virtual network (VNet) data gateway — a managed gateway for securely reaching sources inside an Azure virtual network without managing gateway VMs.

The exam-relevant rule: if the source is on-premises or network-isolated, a gateway is mandatory; pure cloud-to-cloud connections (e.g., Azure SQL Database to Fabric) generally are not. A reusable connection object stores the credentials and binding so multiple items can share authenticated access without re-entering secrets.

Full Load vs. Incremental Load

Ingestion questions often add a volume or cost twist:

  • Full load — reload every row each run. Simple, but expensive and slow at scale.
  • Incremental load — load only new or changed rows, typically using a watermark column (e.g., LastModified) or change data capture (CDC). Preferred for large, frequently updated tables to reduce cost and runtime.

When a scenario stresses large tables, daily refresh, and minimizing capacity consumption, the answer is almost always incremental, not full reload.

Test Your Knowledge

A retail analytics team already keeps several years of sales Parquet files in an existing Azure Data Lake Storage Gen2 account. They want the data available in a Fabric lakehouse for Spark and SQL analysis with the lowest possible storage cost and no duplicate copies. What should they implement?

A
B
C
D
Test Your Knowledge

An operations team needs near-real-time analytical access to an Azure SQL Database that powers a transactional app. They want the analytical copy in OneLake to stay continuously synchronized but do not want to build or maintain any ETL pipelines. Which approach best fits?

A
B
C
D