7.3 Analytics Services — Athena, OpenSearch, and QuickSight

Key Takeaways

  • Amazon Athena runs standard SQL directly on S3 data with no infrastructure and bills $5 per TB of data scanned, so cost is controlled by scanning less data, not by sizing servers.
  • Converting data to columnar Parquet or ORC, partitioning by a query predicate (such as date), and compressing files can cut Athena's scanned bytes by 30–90 percent and the bill with it.
  • Amazon OpenSearch Service (successor to Elasticsearch) is the answer for full-text search, log analytics, and dashboards; ingest logs via Kinesis Data Firehose or CloudWatch Logs subscription filters.
  • Amazon QuickSight is serverless business intelligence with the in-memory SPICE engine and per-session pricing, ideal for embedded dashboards and occasional viewers.
  • Match the workload: Athena for ad-hoc SQL on a data lake, OpenSearch for search and log analytics, QuickSight for visual BI, Redshift for a structured data warehouse, EMR for big-data ETL.
Last updated: June 2026

Amazon Athena: Serverless SQL on S3

Amazon Athena is a serverless, interactive query service that runs standard SQL (Trino/Presto engine) directly against data in Amazon S3. There is nothing to provision; you define a table schema in the AWS Glue Data Catalog and query immediately. Athena is the default answer whenever a question says "run SQL on data already in S3" or "analyze CloudTrail / VPC Flow Logs / ALB access logs without a database."

Pricing is the load-bearing fact: $5 per terabyte of data scanned. There are no compute hours and no idle charge, so the only lever is reducing bytes read.

Cost-reduction techniqueEffect on bytes scanned
Columnar format (Parquet or ORC)Reads only needed columns; 30–90% less
Partition by query predicate (e.g., date)Skips irrelevant partitions entirely
Compress files (Snappy, gzip)Fewer bytes per row read
Avoid SELECT *; project specific columnsReads fewer columns

Worked example: A 10 TB raw CSV log queried with SELECT * scans the full 10 TB ($50). Converting to date-partitioned Parquet and selecting three columns for one day might scan 5 GB instead, dropping the same query to a few cents.

Amazon OpenSearch Service: Search and Log Analytics

Amazon OpenSearch Service (the successor to the former Amazon Elasticsearch Service) provides full-text search with relevance scoring, log and trace analytics, and OpenSearch Dashboards (the Kibana successor) for visualization. It is the exam's answer for "search application logs and build dashboards," "full-text search with relevance ranking," or "real-time observability."

There are two flavors: provisioned domains (you choose instance types and node counts, scaling to petabytes) and OpenSearch Serverless (auto-scaling collections, no node management). Choose serverless when the question stresses unpredictable load and no capacity planning.

Typical ingestion architectures the exam expects you to recognize:

CloudWatch Logs --(subscription filter)--> Lambda --> OpenSearch
VPC Flow Logs / App Logs --> Kinesis Data Firehose --> OpenSearch
OpenSearch strengthWhen it is the answer
Full-text search + relevance"search a product catalog", "fuzzy text search"
Log / trace analytics"centralized log analysis with dashboards"
Near-real-time indexing"observe events seconds after they occur"

Trap: if the requirement is plain SQL on S3 with no search and no dashboards, that is Athena, not OpenSearch. OpenSearch shines specifically at text relevance and operational log exploration.

Amazon QuickSight and Choosing the Right Tool

Amazon QuickSight is AWS's serverless business intelligence (BI) service for interactive dashboards and visualizations. Its in-memory engine, SPICE (Super-fast, Parallel, In-memory Calculation Engine), pre-loads data for fast charting without hammering the source. QuickSight connects to S3, Athena, Redshift, RDS, and on-premises databases, supports per-session pricing for occasional readers, and can embed dashboards in your own applications. ML Insights adds anomaly detection and forecasting.

QuickSight is the answer for "business users need interactive dashboards/reports" or "embed analytics in our app and pay only for active viewers." It is a presentation layer, not a query engine; it reads from sources like Athena or Redshift.

Pick-the-service matrix

NeedService
Ad-hoc SQL on S3, pay per scanAthena
Full-text search + log dashboardsOpenSearch
Business dashboards / visual BIQuickSight
Structured data warehouse, complex joinsRedshift
Spark / Hadoop big-data ETLEMR
Streaming SQL / Flink in real timeManaged Service for Apache Flink

Combined pattern: a classic data-lake answer is S3 (Parquet) -> Athena (query) -> QuickSight (dashboard). Athena does the serverless querying, QuickSight visualizes, and neither requires a running cluster. If the question instead emphasizes sub-second text search over millions of log lines with relevance ranking, switch the query layer to OpenSearch.

The Glue Data Catalog and the Wider Analytics Family

Athena, Redshift Spectrum, and EMR all read table definitions from the AWS Glue Data Catalog, a central metadata store. A Glue crawler scans S3 and infers schema and partitions automatically, so "discover the schema of files in S3 without writing DDL" points to a Glue crawler feeding the catalog, which Athena then queries. AWS Glue itself is the serverless extract-transform-load (ETL) service for converting raw CSV/JSON into the partitioned Parquet that makes Athena cheap.

Know where each neighboring service fits so you can eliminate distractors:

  • Amazon Redshift is a petabyte-scale columnar data warehouse for repeated, complex joins and aggregations over structured data; pick it over Athena when the same heavy queries run constantly and predictable performance matters more than pay-per-scan. Redshift Spectrum lets a Redshift cluster query S3 directly, blending warehouse and lake.
  • Amazon EMR runs managed Apache Spark, Hadoop, Hive, and Presto for large-scale custom ETL and machine learning; choose it when the workload needs Spark code or framework control that Athena's SQL-only model cannot express.
  • Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics) runs continuous SQL or Flink over streaming data for real-time aggregations.
  • Amazon Kinesis Data Firehose is the near-real-time loader that buffers streaming records and delivers them to S3, Redshift, or OpenSearch, often the ingestion arrow drawn in analytics architecture diagrams.
Question signalService
"infer schema from S3 automatically"Glue crawler + Data Catalog
"serverless ETL to Parquet"AWS Glue
"constant complex BI queries, predictable perf"Redshift
"Spark / Hadoop custom processing"EMR
"real-time streaming aggregation"Managed Service for Apache Flink
Test Your Knowledge

A data analyst must run ad-hoc SQL queries on JSON and CSV files already sitting in S3, with no infrastructure to manage and pay only for what is queried. Which service fits?

A
B
C
D
Test Your Knowledge

An Athena query over a 10 TB raw CSV dataset is both slow and expensive. Which change most reduces the cost per query?

A
B
C
D
Test Your Knowledge

An operations team needs to centralize application logs, run full-text searches with relevance ranking, and build interactive log dashboards. Which service is the best fit?

A
B
C
D
Test Your Knowledge

Business users need interactive dashboards built on data already queryable through Athena, with fast rendering and billing only for the people who actually view reports. Which service should be added?

A
B
C
D