7.3 Analytics Services — Athena, OpenSearch, and QuickSight
Key Takeaways
- Amazon Athena runs standard SQL directly on S3 data with no infrastructure and bills $5 per TB of data scanned, so cost is controlled by scanning less data, not by sizing servers.
- Converting data to columnar Parquet or ORC, partitioning by a query predicate (such as date), and compressing files can cut Athena's scanned bytes by 30–90 percent and the bill with it.
- Amazon OpenSearch Service (successor to Elasticsearch) is the answer for full-text search, log analytics, and dashboards; ingest logs via Kinesis Data Firehose or CloudWatch Logs subscription filters.
- Amazon QuickSight is serverless business intelligence with the in-memory SPICE engine and per-session pricing, ideal for embedded dashboards and occasional viewers.
- Match the workload: Athena for ad-hoc SQL on a data lake, OpenSearch for search and log analytics, QuickSight for visual BI, Redshift for a structured data warehouse, EMR for big-data ETL.
Amazon Athena: Serverless SQL on S3
Amazon Athena is a serverless, interactive query service that runs standard SQL (Trino/Presto engine) directly against data in Amazon S3. There is nothing to provision; you define a table schema in the AWS Glue Data Catalog and query immediately. Athena is the default answer whenever a question says "run SQL on data already in S3" or "analyze CloudTrail / VPC Flow Logs / ALB access logs without a database."
Pricing is the load-bearing fact: $5 per terabyte of data scanned. There are no compute hours and no idle charge, so the only lever is reducing bytes read.
| Cost-reduction technique | Effect on bytes scanned |
|---|---|
| Columnar format (Parquet or ORC) | Reads only needed columns; 30–90% less |
| Partition by query predicate (e.g., date) | Skips irrelevant partitions entirely |
| Compress files (Snappy, gzip) | Fewer bytes per row read |
Avoid SELECT *; project specific columns | Reads fewer columns |
Worked example: A 10 TB raw CSV log queried with SELECT * scans the full 10 TB ($50). Converting to date-partitioned Parquet and selecting three columns for one day might scan 5 GB instead, dropping the same query to a few cents.
Amazon OpenSearch Service: Search and Log Analytics
Amazon OpenSearch Service (the successor to the former Amazon Elasticsearch Service) provides full-text search with relevance scoring, log and trace analytics, and OpenSearch Dashboards (the Kibana successor) for visualization. It is the exam's answer for "search application logs and build dashboards," "full-text search with relevance ranking," or "real-time observability."
There are two flavors: provisioned domains (you choose instance types and node counts, scaling to petabytes) and OpenSearch Serverless (auto-scaling collections, no node management). Choose serverless when the question stresses unpredictable load and no capacity planning.
Typical ingestion architectures the exam expects you to recognize:
CloudWatch Logs --(subscription filter)--> Lambda --> OpenSearch
VPC Flow Logs / App Logs --> Kinesis Data Firehose --> OpenSearch
| OpenSearch strength | When it is the answer |
|---|---|
| Full-text search + relevance | "search a product catalog", "fuzzy text search" |
| Log / trace analytics | "centralized log analysis with dashboards" |
| Near-real-time indexing | "observe events seconds after they occur" |
Trap: if the requirement is plain SQL on S3 with no search and no dashboards, that is Athena, not OpenSearch. OpenSearch shines specifically at text relevance and operational log exploration.
Amazon QuickSight and Choosing the Right Tool
Amazon QuickSight is AWS's serverless business intelligence (BI) service for interactive dashboards and visualizations. Its in-memory engine, SPICE (Super-fast, Parallel, In-memory Calculation Engine), pre-loads data for fast charting without hammering the source. QuickSight connects to S3, Athena, Redshift, RDS, and on-premises databases, supports per-session pricing for occasional readers, and can embed dashboards in your own applications. ML Insights adds anomaly detection and forecasting.
QuickSight is the answer for "business users need interactive dashboards/reports" or "embed analytics in our app and pay only for active viewers." It is a presentation layer, not a query engine; it reads from sources like Athena or Redshift.
Pick-the-service matrix
| Need | Service |
|---|---|
| Ad-hoc SQL on S3, pay per scan | Athena |
| Full-text search + log dashboards | OpenSearch |
| Business dashboards / visual BI | QuickSight |
| Structured data warehouse, complex joins | Redshift |
| Spark / Hadoop big-data ETL | EMR |
| Streaming SQL / Flink in real time | Managed Service for Apache Flink |
Combined pattern: a classic data-lake answer is S3 (Parquet) -> Athena (query) -> QuickSight (dashboard). Athena does the serverless querying, QuickSight visualizes, and neither requires a running cluster. If the question instead emphasizes sub-second text search over millions of log lines with relevance ranking, switch the query layer to OpenSearch.
The Glue Data Catalog and the Wider Analytics Family
Athena, Redshift Spectrum, and EMR all read table definitions from the AWS Glue Data Catalog, a central metadata store. A Glue crawler scans S3 and infers schema and partitions automatically, so "discover the schema of files in S3 without writing DDL" points to a Glue crawler feeding the catalog, which Athena then queries. AWS Glue itself is the serverless extract-transform-load (ETL) service for converting raw CSV/JSON into the partitioned Parquet that makes Athena cheap.
Know where each neighboring service fits so you can eliminate distractors:
- Amazon Redshift is a petabyte-scale columnar data warehouse for repeated, complex joins and aggregations over structured data; pick it over Athena when the same heavy queries run constantly and predictable performance matters more than pay-per-scan. Redshift Spectrum lets a Redshift cluster query S3 directly, blending warehouse and lake.
- Amazon EMR runs managed Apache Spark, Hadoop, Hive, and Presto for large-scale custom ETL and machine learning; choose it when the workload needs Spark code or framework control that Athena's SQL-only model cannot express.
- Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) runs continuous SQL or Flink over streaming data for real-time aggregations.
- Amazon Kinesis Data Firehose is the near-real-time loader that buffers streaming records and delivers them to S3, Redshift, or OpenSearch, often the ingestion arrow drawn in analytics architecture diagrams.
| Question signal | Service |
|---|---|
| "infer schema from S3 automatically" | Glue crawler + Data Catalog |
| "serverless ETL to Parquet" | AWS Glue |
| "constant complex BI queries, predictable perf" | Redshift |
| "Spark / Hadoop custom processing" | EMR |
| "real-time streaming aggregation" | Managed Service for Apache Flink |
A data analyst must run ad-hoc SQL queries on JSON and CSV files already sitting in S3, with no infrastructure to manage and pay only for what is queried. Which service fits?
An Athena query over a 10 TB raw CSV dataset is both slow and expensive. Which change most reduces the cost per query?
An operations team needs to centralize application logs, run full-text searches with relevance ranking, and build interactive log dashboards. Which service is the best fit?
Business users need interactive dashboards built on data already queryable through Athena, with fast rendering and billing only for the people who actually view reports. Which service should be added?