4.3 Scenario Practice for Monitor and optimize an analytics solution
Key Takeaways
- OPTIMIZE bin-compacts small Parquet files into target-sized files; Fabric applies bin-compaction, then Z-Order, then V-Order.
- V-Order is write-time sorting/encoding that speeds read-heavy queries; Z-Order co-locates values to improve data skipping.
- VACUUM removes unreferenced files older than the retention threshold; the default and minimum safe threshold is 7 days.
- Fabric Warehouse auto-creates and auto-refreshes histogram, average-column-length, and cardinality statistics for the optimizer.
- Query insights (queryinsights schema) retains 30 days of historical query data; DMVs give live session/request status.
Delta table maintenance scenarios
Most lakehouse performance questions reduce to Delta table maintenance. Frequent small writes (streaming, incremental loads, many small files) produce the small-file problem, which slows reads. The fix is OPTIMIZE, which groups small Parquet files into bins targeting an ideal file size and rewrites them. Fabric applies operations in a fixed order: bin-compaction first, then Z-Order, then V-Order.
- V-Order is a write-time optimization (sorting, encoding, compression) of the Parquet layout that accelerates read-heavy queries, especially Power BI Direct Lake. It is applied during OPTIMIZE or on write.
- Z-Order co-locates related column values in the same files so queries that filter on those columns skip more data. Combine Z-Order with V-Order in one command for filtered analytical workloads.
- VACUUM deletes files no longer referenced by the Delta log that are older than the retention threshold. The default is seven days, and Fabric rejects retention intervals under seven days by default to protect time-travel and concurrent readers.
You can run these from a Spark SQL notebook, a Spark job definition, or the Lakehouse table maintenance UI (which exposes OPTIMIZE with optional V-Order and VACUUM together).
Warehouse and query performance scenarios
For the Fabric Warehouse and the SQL analytics endpoint, the engine automatically creates statistics whenever the optimizer needs them and auto-refreshes stale ones. Supported automatic statistics are histogram, average column length, and table cardinality. You can still issue CREATE STATISTICS / UPDATE STATISTICS for single-column histograms when you want manual control — useful when a load reshapes data and you want fresh statistics before a critical query.
To diagnose slow queries, use query insights: autogenerated views under the queryinsights schema retain 30 days of historical query data and surface long-running and frequently-run queries. For live activity — current connections, sessions, and requests — query the dynamic management views (DMVs) such as sys.dm_exec_requests. Warehouse-level tuning also favors avoiding many small inserts (batch loads), choosing appropriate data types, and letting the distributed engine parallelize.
| Symptom | Likely remedy |
|---|---|
| Many tiny Parquet files, slow scans | OPTIMIZE (bin-compaction) |
| Filtered queries read too much data | Z-Order on the filter columns |
| Slow Direct Lake / read-heavy reports | V-Order |
| Storage bloat from old versions | VACUUM (>= 7-day retention) |
| Stale plan estimates in Warehouse | UPDATE STATISTICS / rely on auto-stats |
| Find the worst historical query | queryinsights views (30 days) |
Spark configuration scenarios
Spark performance questions hinge on compute configuration. Starter pools give 5-10 second session startup with always-on medium nodes. Custom pools let you set node size and autoscale min/max nodes, while dynamic executor allocation lets a running application request more executors within bounds when tasks exceed current capacity. For many notebooks sharing the same lakehouse, environment, and Spark config, enable High Concurrency mode so they share one session and avoid repeated startup overhead.
The Native Execution Engine (NEE) boosts performance by running operations in a vectorized C++ engine; it does not support UDFs and automatically falls back to the JVM for them. Read the stem: 'cold-start latency' points to starter/high-concurrency, 'variable load' points to autoscale plus dynamic allocation, and 'raw throughput on standard operations' points to the Native Execution Engine.
Partitioning and OneLake optimization scenarios
Partitioning is a double-edged tool. Partitioning a Delta table by a low-cardinality column that queries filter on (for example year or region) prunes whole folders and speeds reads. But over-partitioning by a high-cardinality column re-creates the small-file problem — thousands of tiny partitions, each with tiny files. The exam tests this judgment: partition on columns that are frequently filtered and low-cardinality, and otherwise rely on Z-Order for data skipping within larger files.
For OneLake and the lakehouse broadly, the optimization levers are: keep files at a healthy size (run OPTIMIZE after sizable writes), apply V-Order for read-heavy Direct Lake consumption, VACUUM on a sensible schedule to control storage and version sprawl, and use shortcuts to avoid copying data into OneLake unnecessarily (reducing duplication and ingestion CU). A Direct Lake semantic model reads Parquet directly from OneLake, so V-Ordered, well-compacted tables translate straight into faster reports.
Scenario reading method
Approach every performance scenario the same way: identify the layer (lakehouse Delta, Warehouse SQL, or Spark compute), find the bottleneck cue in the stem (small files, missing skipping, stale stats, cold start, executor pressure), and map it to the single matching lever. When two answers look plausible, the better one targets the exact bottleneck the stem describes rather than a generic 'make it bigger' fix.
Scheduling maintenance sensibly
Maintenance itself consumes CU, so timing matters. Run OPTIMIZE after meaningful write volume (for example after a daily batch load or once a streaming table accumulates many small files), not on every micro-batch. Schedule VACUUM periodically with at least the 7-day default retention so time travel and concurrent readers stay safe. For tables consumed by Direct Lake, prioritize V-Order so the most-read tables render reports fastest.
The exam may ask which maintenance to run and when — the best answers compact and V-Order read-heavy tables on a cadence proportional to write activity, while reserving VACUUM for controlled storage cleanup rather than constant execution.
A streaming load has produced thousands of tiny Parquet files in a Delta table, and analytical scans have become slow. Which maintenance command directly addresses the small-file problem?
Direct Lake reports on a large Delta table are read-heavy and you want the Parquet layout optimized so reads are as fast as possible. Which optimization is designed for this read-heavy scenario?
You must identify the slowest and most frequently executed queries that ran against a Fabric Warehouse over the past few weeks. Which feature should you query?