3.3 Scenario Practice for Ingest and transform data
Key Takeaways
- Streaming ingestion starts with an Eventstream sourced from Event Hubs, Kafka, IoT, or CDC, then routes to a lakehouse, eventhouse, or derived stream.
- Eventhouse (KQL DB) supports Direct ingestion or Event-processing-before-ingestion; KQL update policies transform rows on insert from Bronze to Silver.
- Windowing functions — tumbling, hopping, sliding, and session — aggregate streaming data over time, and the exam expects you to match the window type to the requirement.
- Spark structured streaming is the pro-code alternative to Eventstream for streaming transforms that need custom logic over micro-batches.
- Loading a dimensional model means denormalizing, generating surrogate keys, and handling slowly changing dimensions before writing Gold tables.
Scenario: Build a Real-Time Streaming Path
A retailer wants live order telemetry from Azure Event Hubs captured, lightly filtered, and landed for both interactive KQL analytics and a Delta lakehouse. The Fabric pattern is:
- Create an Eventstream and add Event Hubs as a source. Eventstream is Fabric's no-code service for capturing streaming events from Event Hubs, Kafka, Azure IoT, CDC sources, and sample data.
- Add stream operations — Filter, Manage fields, Aggregate, Join — directly on the eventstream canvas to shape events in flight.
- Route the result to destinations. To feed interactive analytics, add an Eventhouse (KQL Database) destination; to feed Spark and reporting, add a Lakehouse destination, which lands events as Delta rows.
When you add transformations and then need to reuse the shaped output, you create a derived stream — a destination representing the transformed default stream that other destinations or Activator can consume. This is the streaming equivalent of a Silver layer.
Eventstream supports many sources (Azure Event Hubs, Apache Kafka, Azure IoT Hub, CDC feeds from databases, Azure Service Bus, and sample data) and many destinations (Eventhouse, Lakehouse, derived stream, custom endpoint, and Fabric Activator for alerting). Routing one eventstream to multiple destinations is common: the same order feed can land raw in a lakehouse for batch reporting and flow to an eventhouse for live dashboards. In Real-Time Intelligence you also decide between a native Eventhouse table and a OneLake shortcut: native tables give the best KQL query performance, while shortcuts avoid copying data.
When a shortcut's query speed matters, Query acceleration for OneLake shortcuts caches and indexes the shortcut data to approach native-table performance at extra cost.
Eventhouse Ingestion Modes and KQL
When routing to an Eventhouse, you choose one of two ingestion modes:
| Mode | Behavior | Use when |
|---|---|---|
| Direct ingestion | Events land in the KQL table unprocessed | You want raw Bronze telemetry and will transform later in KQL |
| Event processing before ingestion | Eventstream shapes events first, then ingests | You want fields filtered/renamed before they hit the table |
Inside the eventhouse, KQL update policies push a Bronze→Silver transform: when rows arrive in a source table, the update policy runs a KQL query and writes the result into a target table automatically. This is the real-time analog of a notebook moving data between medallion layers.
KQL is also where you query high-volume, log-style, semi-structured telemetry interactively. On the exam, KQL is the correct transformation language only when data lives in an eventhouse — not in a lakehouse or warehouse.
Choosing a Window Function
Streaming aggregations need a time window. The exam tests matching the window type to the requirement:
- Tumbling — fixed-size, non-overlapping buckets (e.g., total orders per 1-minute bucket). Each event belongs to exactly one window.
- Hopping — fixed-size windows that overlap by a hop interval (e.g., a 5-minute window advancing every 1 minute) — used for moving averages.
- Sliding — emits a window whenever an event enters or leaves, so windows are event-driven rather than clock-driven.
- Session — groups bursts of activity separated by a gap of inactivity (e.g., user events with a 30-second idle gap end the session).
For non-overlapping periodic totals, choose tumbling. For smoothed moving metrics, choose hopping. For grouping bursts of user activity, choose session. You implement windowing either in the Eventstream SQL operator or in Spark structured streaming, which processes micro-batches with the same window semantics when you need custom pro-code logic.
Scenario: Prepare Data for a Dimensional Model
Gold-layer tables usually feed a star schema semantic model, so Silver→Gold work involves shaping facts and dimensions:
- Denormalize — flatten normalized source tables into wide dimension tables (e.g., combine
Product,Category,Subcategoryinto oneDimProduct). - Generate surrogate keys — assign stable integer keys to dimension rows so the model is independent of source business keys.
- Handle slowly changing dimensions (SCD) — decide Type 1 (overwrite) versus Type 2 (add a new versioned row with effective dates) for changing attributes.
- Group and aggregate — pre-compute fact aggregates that reports consume.
- Handle duplicate, missing, and late-arriving data — deduplicate on business keys, impute or flag missing values, and reconcile late events against the right window.
For complex SCD Type 2 logic over large fact tables, a Spark notebook is the right tool; a Dataflow handles lighter reshaping for analysts.
Worked example: dedup and late data in PySpark
Suppose Bronze holds order events that can arrive twice (at-least-once delivery) and occasionally late. A typical Silver notebook in PySpark would: read the Bronze Delta table, dropDuplicates(["OrderId", "EventTime"]) to remove exact repeats, keep the latest version per business key with a window ranked by EventTime, cast types, and MERGE into the Silver Delta table so re-runs are idempotent. The same logic in Spark SQL uses ROW_NUMBER() OVER (PARTITION BY OrderId ORDER BY EventTime DESC) and a MERGE INTO ... WHEN MATCHED ... WHEN NOT MATCHED.
The exam may show you a fragment and ask which technique deduplicates or upserts — recognize that MERGE (not plain INSERT) is what makes a load idempotent and handles late corrections to existing keys. For purely additive, no-logic movement you would not write any of this; you would use Copy. The presence of dedup, surrogate keys, SCD, or upsert logic in the requirement is the signal that you have crossed from ingestion into transformation and need a notebook or Dataflow.
A dashboard needs the count of transactions for each non-overlapping one-minute period from a streaming source. Which windowing function should the streaming query use?
Events are routed to an Eventhouse and must be transformed automatically from a raw source table into a cleaned target table as they arrive. Which feature accomplishes this inside the eventhouse?
Which Eventstream feature represents a transformed version of the default stream that other destinations or Activator can subscribe to after filter/aggregate operations are applied?