5.2 Azure Synapse Analytics
Key Takeaways
- A Synapse workspace bundles dedicated SQL pool, serverless SQL pool, Apache Spark pool, and pipelines, all managed in Synapse Studio.
- A dedicated SQL pool (formerly Azure SQL Data Warehouse) is an MPP warehouse billed in DWUs and can be paused to save compute cost.
- A serverless SQL pool charges per terabyte of data processed and queries files in ADLS Gen2 directly without loading them.
- Spark pools auto-scale and auto-pause, and are the right choice for Python, Scala, or .NET notebooks on big data.
- Synapse pipelines are an embedded subset of Azure Data Factory, sharing the same activities, triggers, and integration runtimes.
Azure Synapse Analytics
Azure Synapse Analytics is Microsoft's unified analytics service that brings data warehousing, big-data processing, data integration, and serverless SQL into a single workspace. On the DP-900, expect questions that ask you to pick the right compute option for a scenario, so focus on what each pool actually does and how it bills.
Quick Answer: A Synapse workspace contains four kinds of compute — dedicated SQL pool (provisioned MPP warehouse), serverless SQL pool (pay-per-query T-SQL over files), Apache Spark pool (notebooks and big-data jobs), and pipelines (the Data Factory ETL engine embedded inside Synapse) — all managed from Synapse Studio.
The Synapse Workspace
A Synapse workspace is the top-level resource. Inside it you get:
- A default Azure Data Lake Storage Gen2 account that acts as the workspace's primary storage.
- One or more compute pools (you add them as needed).
- Synapse Studio, a browser-based IDE for writing T-SQL, authoring Spark notebooks, building pipelines, and managing security — all without leaving the portal.
- Integrated Azure Active Directory (Microsoft Entra ID) authentication and Azure RBAC.
Dedicated SQL Pool (formerly SQL Data Warehouse)
The dedicated SQL pool is the modern name for what used to be sold as Azure SQL Data Warehouse. It is a massively parallel processing (MPP) engine: a control node breaks each T-SQL query into smaller pieces and runs them across 60 distributions on multiple compute nodes.
Key facts:
- Billed by Data Warehouse Units (DWUs) — a blended measure of CPU, memory, and I/O. You provision a level such as DW100c, DW1000c, DW3000c.
- You can pause the pool to stop compute charges (storage continues to bill).
- Supports table distribution strategies: hash-distributed for large fact tables, round-robin for staging tables, and replicated for small dimension tables.
- Uses columnstore indexes by default for compression and analytic speed.
- Best for predictable, high-volume BI workloads up to petabytes.
Serverless SQL Pool
Every workspace ships with a serverless SQL pool ("Built-in") that is always on but charges only when you run a query. It is on-demand: you pay per TB of data processed, with no provisioning to manage.
Use serverless SQL when you want to:
- Run ad-hoc T-SQL queries directly over Parquet, CSV, JSON, or Delta files in ADLS Gen2 without copying them into a database.
- Create external tables and views so BI tools can query lake files using familiar T-SQL.
- Build a logical data warehouse that joins files across multiple storage accounts.
You cannot use serverless SQL pool to ingest or persist data into managed tables — it is read-mostly query compute.
Apache Spark Pool
A Spark pool is a managed Apache Spark cluster you can spin up inside the workspace for Python (PySpark), Scala, .NET, and SQL workloads. Spark pools are the right choice when:
- You need to process unstructured or semi-structured data at scale.
- Data scientists want notebooks to combine Spark SQL, ML libraries, and visualizations.
- You are building data engineering pipelines that run heavy joins, machine learning, or graph processing.
Spark pools auto-pause after a configurable idle period and auto-scale executors within a defined min/max range, so you pay only for what runs.
Synapse Pipelines
Synapse pipelines are the data integration engine inside the workspace. They use the same activities, datasets, triggers, and integration runtimes as Azure Data Factory — in fact, Synapse pipelines are a subset of ADF re-hosted inside Synapse Studio. You can copy data, orchestrate Spark notebooks, run stored procedures on the dedicated SQL pool, and call REST endpoints from a single pipeline.
Synapse Studio
Synapse Studio is the web UI where developers, analysts, and engineers collaborate. Tabs include:
- Data — browse linked storage, lake files, and SQL/Spark databases.
- Develop — write T-SQL scripts, Spark notebooks, KQL scripts, and Power BI reports.
- Integrate — build pipelines and data flows.
- Monitor — track pipeline runs, SQL requests, Spark applications.
- Manage — configure pools, linked services, security, and Git integration.
Choosing the Right Compute Option
| Scenario | Best Synapse compute |
|---|---|
| 24/7 multi-TB enterprise data warehouse with predictable cost | Dedicated SQL pool |
| One-off T-SQL exploration of CSV/Parquet in the lake | Serverless SQL pool |
| Python/Scala ML notebooks on huge datasets | Apache Spark pool |
| Orchestrate copy and transformation jobs across systems | Synapse pipelines |
| Federated SQL over Cosmos DB analytical store | Serverless SQL pool with Synapse Link |
Synapse Link
Azure Synapse Link creates a near-real-time replica of operational data (Cosmos DB, Dataverse, SQL Server 2022) into the Synapse workspace, so you can run analytics on transactional data without ETL and without impacting the source database. The classic exam scenario is "run analytics on Cosmos DB without affecting transactional performance" — the answer is Synapse Link.
Synapse vs Microsoft Fabric
Synapse is still fully supported, but Microsoft positions Microsoft Fabric as the strategic successor that unifies Synapse, Data Factory, and Power BI under one SaaS platform with OneLake at its core. For DP-900 you should know both: Synapse for current Azure-resource customers, Fabric for the unified next-generation platform.
A data engineer needs to run T-SQL queries on Parquet files in Azure Data Lake Storage Gen2. The queries are infrequent and the team does not want to provision or pay for idle compute. Which Synapse component fits best?
Which statement about a dedicated SQL pool in Azure Synapse Analytics is TRUE?