3.5 Practice Drills and Readiness Markers
Key Takeaways
- You are ready when you can instantly map a scenario to Copy activity, Copy job, Dataflow Gen2, notebook, Eventstream, or Eventhouse based on volume, latency, and effort cues.
- Master the medallion flow end to end: shortcut/mirror or Copy into Bronze, Spark/Dataflow into Silver, denormalized dimensional model into Gold.
- Be able to build a watermark incremental pipeline by hand (Lookup + Copy + watermark write) and explain when Copy job replaces all of it.
- Know the streaming stack cold: Eventstream sources/operations/destinations, derived streams, Eventhouse ingestion modes, KQL update policies, and the four window types.
- Practice optimization basics: OPTIMIZE/V-Order compaction and partitioning for lakehouse tables that load slowly or query slowly.
Readiness Self-Check
Before exam day, you should be able to answer each of these in seconds:
- Given a source, volume, latency, and "least effort" constraint, name the correct ingestion tool and the transformation language for the destination.
- Sketch a medallion flow: how does raw data reach Bronze (shortcut, mirror, Copy, or Eventstream), Silver (notebook or Dataflow), and Gold (dimensional model in a warehouse)?
- Build a watermark-based incremental pipeline by hand, and explain why Copy job removes the need for it.
- Pick the right window function (tumbling, hopping, sliding, session) for a stated streaming aggregation.
- Decide between a native KQL table and a OneLake shortcut for Real-Time Intelligence, and when Query acceleration for a shortcut is worth it.
If any of these is slow, return to the relevant section. The exam gives you many short scenarios, so speed and confident elimination of distractors matter as much as raw knowledge.
A useful mental checklist for any ingestion question is Source → Volume → Latency → Transform → Destination. Identify each in the prompt: where the data comes from, how much there is, how fresh it must be, what reshaping it needs, and where it must land. Those five answers almost always collapse to a single correct tool and language.
For example, operational SQL source + millions of changed rows + near real time + no reshaping + OneLake resolves to mirroring; flat files + terabytes + nightly + none + lakehouse resolves to Copy activity; Kafka + continuous + sub-second + filter + eventhouse resolves to Eventstream into a KQL database.
Drill: Tool-Selection Reps
Run these reps and justify each pick out loud:
| Scenario | Correct choice |
|---|---|
| Move 5 TB from blob storage to a lakehouse, no transforms, scheduled nightly | Copy activity in a pipeline |
| Recurring CDC replication from Azure SQL DB, least effort | Copy job (or mirroring) |
| Analysts clean and merge CSVs with Power Query before Silver | Dataflow Gen2 |
| SCD Type 2 over a 300M-row dimension | Spark notebook |
| Capture IoT telemetry and route to KQL + lakehouse | Eventstream |
| Interactive ad-hoc analytics on billions of log rows | Eventhouse / KQL |
| Make an existing ADLS Gen2 folder queryable without copying | OneLake shortcut |
When you can explain why each alternative is wrong, not just why your pick is right, you are at exam standard. Most DP-700 questions are won by eliminating two plausible-but-wrong tools.
Extend the drill to transformation-language reps. For each store, state the language and one thing only that store can do well: lakehouse → PySpark/Spark SQL, best for large distributed transforms and ML; warehouse → T-SQL, best for multi-table transactions and stored-procedure logic; Dataflow → M, best for analyst-friendly Power Query cleaning; eventhouse → KQL, best for interactive analytics on telemetry.
Then practice the streaming reps: name the four window types and one requirement each fits — tumbling for non-overlapping period totals, hopping for moving averages, sliding for event-driven emission, session for grouping bursts by an inactivity gap. Being able to recite these without hesitation removes a whole class of avoidable mistakes.
Drill: Optimization and Reliability
The monitor-and-optimize domain overlaps here, so practice the table-tuning levers that keep ingestion healthy:
- OPTIMIZE / table compaction — merges many small Parquet files into fewer large ones; small-file proliferation from frequent micro-batch loads is the most common cause of slow lakehouse reads.
- V-Order — Fabric's write-time optimization that improves read performance for Power BI and SQL engines on Delta tables.
- Partitioning — partition large fact tables on a high-cardinality-but-pruning-friendly column (often a date) so queries scan less data; avoid over-partitioning, which creates tiny files.
- VACUUM — removes old, unreferenced data files after retention, reclaiming storage; respect the retention window so you do not break time-travel queries that still need older versions.
- Pipeline retry and monitoring — configure activity retry counts and intervals, set timeouts, and use the Monitor hub and run history to find failed activities, inspect input/output, and re-run from the point of failure rather than restarting the whole pipeline.
Finally, rehearse the failure-mode reasoning: a slow load usually means small files (compact), a skipped-rows incremental usually means a bad watermark, and a streaming gap usually means an under-provisioned eventhouse or wrong ingestion mode. Pattern-matching symptoms to root causes is exactly how the optimization questions are framed.
Close your preparation by linking ingestion to the dimensional model it feeds. Gold tables should be denormalized, carry surrogate keys, and implement the correct SCD type, because the downstream semantic model and Power BI reports assume a clean star schema. If a report shows duplicated dimension members, the root cause is usually a missing dedup or a broken surrogate-key assignment upstream in Silver — not the report itself. Tracing a reporting symptom back through Gold, Silver, and Bronze to the offending ingestion step is the highest-order skill this domain tests, and it is what separates a passing score from a guess.
When you can do that reliably across batch and streaming paths, you have met the SIE/EA depth standard for this chapter and are ready for the Ingest-and-transform questions on DP-700.
A lakehouse table loaded by frequent micro-batches has become slow to query. Which maintenance operation most directly addresses the small-file problem?
When eliminating distractors on a DP-700 tool-selection question, which scenario cue most strongly points away from a Spark notebook?
An incremental load reports missing updated rows even though new inserts appear correctly. What is the most likely root cause to investigate first?
Which combination correctly pairs each medallion layer with a typical Fabric implementation in a lakehouse-plus-warehouse design?