2.11 Azure Data Services and Analytics
Key Takeaways
- Azure Synapse Analytics unifies data warehousing (MPP SQL pools) and big data (Spark pools) with built-in pipelines and Power BI integration.
- Azure Data Factory is the cloud ETL/ELT pipeline orchestrator with 90+ connectors and a visual designer.
- Azure Databricks is an Apache Spark platform for collaborative data engineering and machine learning, with Delta Lake and MLflow.
- Azure HDInsight is managed open-source big data (Hadoop, Spark, Hive, Kafka); Stream Analytics handles real-time/IoT streams.
- Data Lake Storage Gen2 adds a hierarchical namespace and POSIX ACLs on top of Blob Storage as the foundation for analytics.
Quick Answer: Synapse Analytics = unified data warehouse + big data. Data Factory = ETL/ELT pipelines. Databricks = Apache Spark for engineering and machine learning. HDInsight = managed open-source big data (Hadoop/Spark/Kafka/Hive). Stream Analytics = real-time/IoT streams. Power BI = dashboards. Data Lake Storage Gen2 = analytics-optimized storage. AZ-900 only tests the one-line purpose of each — match the keyword, do not memorize internals.
How the Pieces Fit Together
A typical analytics pipeline reads like a sentence: raw data lands in Data Lake Storage Gen2, Data Factory moves and transforms it, Synapse or Databricks processes and models it, and Power BI visualizes the result. Knowing that flow makes the "which service" questions easy.
The AZ-900 objective treats these as descriptive knowledge — you are asked to recognize what each tool is for, not how to configure clusters, write Spark code, or tune SQL pools. Expect questions phrased as a one-sentence scenario whose keyword ("pipeline," "warehouse," "real-time," "notebook," "dashboard") maps to exactly one service. The most common trap is the Synapse-vs-Databricks overlap, covered below.
Azure Synapse Analytics
Azure Synapse Analytics (the evolution of the former SQL Data Warehouse) is an integrated analytics service that brings warehousing and big data into one workspace:
| Component | Purpose |
|---|---|
| Dedicated/serverless SQL pools | Massively parallel processing (MPP) queries over structured data |
| Apache Spark pools | Big-data processing and machine learning |
| Pipelines | Built-in Data Factory integration for data movement |
| Synapse Studio | One UI for SQL, Spark, and pipeline development |
| Power BI link | Direct visualization of warehouse data |
Reach for Synapse when a question says "enterprise data warehouse," "unified analytics," or "combine warehousing and big data in one place."
Azure Data Factory
Azure Data Factory (ADF) is the cloud ETL/ELT (Extract-Transform-Load / Extract-Load-Transform) orchestrator. It does not store data itself — it moves and transforms it on a schedule, on demand, or on an event.
- 90+ connectors to cloud and on-premises sources
- A visual, drag-and-drop pipeline designer (no code required for many flows)
- Built-in scheduling, triggers, monitoring, and alerting
Whenever the keyword is "pipeline," "ETL," "orchestrate data movement," or "copy data from many sources," the answer is Data Factory.
Azure Databricks
Azure Databricks is a fast, collaborative Apache Spark analytics platform built with Databricks:
- Collaborative notebooks shared by data engineers and data scientists
- Auto-scaling Spark clusters that grow and shrink with the job
- Delta Lake for ACID transactions on a data lake
- MLflow for tracking experiments and managing machine-learning models
The distinguishing keywords are "Spark," "collaborative notebooks," and "machine learning." Note the overlap trap: both Synapse and Databricks can run Spark, but "collaborative data science / ML platform" leans Databricks, while "single workspace that also has an MPP SQL data warehouse" leans Synapse.
HDInsight, Stream Analytics, and Power BI
- Azure HDInsight is a fully managed service for open-source big-data frameworks — Hadoop, Spark, Hive, Kafka, HBase. Pick it when a question names a specific open-source project to run as-is.
- Azure Stream Analytics processes real-time data streams (often from IoT devices or Event Hubs) with SQL-like queries. Keyword: "real-time" or "streaming."
- Power BI is the business-intelligence visualization layer — interactive reports and dashboards. Keyword: "dashboard," "report," "visualize."
Azure Data Lake Storage Gen2
Data Lake Storage Gen2 is not a separate product but a set of capabilities layered onto Blob Storage:
- Hierarchical namespace — real directories and subdirectories with fast atomic folder operations (renames, deletes), instead of a flat blob list.
- POSIX-compatible ACLs — fine-grained access control on individual files and folders.
- Inherits Blob features: access tiers, lifecycle management, and the LRS-through-GZRS redundancy options.
- Optimized to feed analytics engines such as Synapse, Databricks, and HDInsight.
Why the hierarchical namespace matters: on a flat blob store, "renaming a folder" actually means copying and re-listing every object underneath it, which is slow and expensive at petabyte scale. Gen2's true directory structure makes that a single fast metadata operation, which is why analytics engines that constantly read and reorganize partitioned datasets run far more efficiently against it. The exam phrasing to watch for is "big data analytics storage with a hierarchical namespace" — that points squarely at Data Lake Storage Gen2 rather than plain Blob Storage.
Quick Reference
| Service | Category | One-line purpose |
|---|---|---|
| Synapse Analytics | Unified analytics | Warehouse + big data in one workspace |
| Data Factory | ETL/ELT | Pipeline orchestration and data movement |
| Databricks | Spark analytics | Collaborative data science and ML |
| HDInsight | Managed open-source | Hadoop, Spark, Kafka, Hive workloads |
| Stream Analytics | Real-time | IoT and streaming data processing |
| Power BI | Visualization | Dashboards and BI reports |
| Data Lake Storage Gen2 | Storage | Analytics-optimized big-data storage |
On the Exam: You are not asked to configure these services — only to recognize their purpose. If you can finish the sentence "___ is used for ___" for each row above, you are ready for this objective.
Which service provides a single workspace that combines an MPP SQL data warehouse with Apache Spark big-data processing?
A team needs to copy and transform data from 12 different on-premises and cloud sources on a nightly schedule, using a visual designer rather than code. Which service fits?
A factory streams sensor readings continuously and needs them analyzed in real time as they arrive. Which Azure service is designed for this?
What capability does Data Lake Storage Gen2 add on top of standard Blob Storage to make it better for big-data analytics?