2.11 Azure Data Services and Analytics
Key Takeaways
- Azure Synapse Analytics is an integrated analytics service combining big data and data warehousing for end-to-end analytics.
- Azure Data Factory is a cloud-based ETL (Extract, Transform, Load) service for data integration and pipeline orchestration.
- Azure Databricks provides an Apache Spark-based analytics platform for big data processing and machine learning.
- Azure HDInsight is a managed open-source analytics service supporting Hadoop, Spark, Hive, Kafka, and more.
- Azure Data Lake Storage Gen2 combines Blob Storage scalability with a hierarchical file system for big data analytics.
Azure Data Services and Analytics
Quick Answer: Synapse Analytics = unified analytics (data warehouse + big data). Data Factory = ETL pipelines. Databricks = Spark-based analytics. HDInsight = managed Hadoop/Spark. Data Lake = big data storage with hierarchical namespace.
Azure Synapse Analytics
Azure Synapse Analytics (formerly SQL Data Warehouse) is an integrated analytics service that brings together big data and data warehousing:
| Capability | Description |
|---|---|
| SQL pools | Massively parallel processing (MPP) SQL queries on structured data |
| Spark pools | Apache Spark for big data processing and machine learning |
| Pipelines | Data integration (built-in Data Factory capabilities) |
| Studio | Unified workspace for SQL, Spark, and pipeline development |
| Power BI integration | Direct integration for visualization |
Azure Data Factory
Azure Data Factory is a cloud-based data integration service for creating ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines.
Key features:
- 90+ connectors — Connect to cloud and on-premises data sources
- Visual pipeline designer — Drag-and-drop interface for building data flows
- Scheduling — Trigger pipelines on a schedule, on-demand, or based on events
- Monitoring — Built-in pipeline monitoring and alerting
Azure Databricks
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform:
- Collaborative notebooks — Data scientists and engineers work together
- Auto-scaling clusters — Scale Spark clusters automatically
- MLflow integration — Track experiments and manage ML models
- Delta Lake — ACID transactions on data lakes
Azure Data Lake Storage Gen2
Data Lake Storage Gen2 builds on Azure Blob Storage and adds:
- Hierarchical namespace — File system semantics with directories and subdirectories
- POSIX-compatible ACLs — Fine-grained access control on directories and files
- Blob Storage features — Tiering, lifecycle management, redundancy options
- Optimized for analytics — Works with Synapse, Databricks, HDInsight, and other analytics engines
Data Service Quick Reference
| Service | Type | Best For |
|---|---|---|
| Synapse Analytics | Unified analytics | Enterprise data warehousing + big data |
| Data Factory | ETL/ELT | Data pipeline orchestration |
| Databricks | Spark analytics | Collaborative data science and ML |
| HDInsight | Managed open-source | Hadoop, Spark, Kafka, Hive workloads |
| Data Lake Storage Gen2 | Storage | Big data analytics storage |
| Stream Analytics | Real-time analytics | IoT and streaming data processing |
| Power BI | Visualization | Business intelligence dashboards |
On the Exam: You do not need deep knowledge of these analytics services. Know the HIGH-LEVEL purpose of each: Synapse = unified analytics, Data Factory = ETL pipelines, Databricks = Spark + ML, Data Lake = big data storage.
Which Azure service provides a unified analytics workspace combining data warehousing and big data?
Which Azure service is specifically designed for creating ETL/ELT data pipelines?