7.6 Data Foundation Services: S3, Glue, OpenSearch, and QuickSight

Key Takeaways

  • AI service selection depends on data foundations: storage, cataloging, transformation, search, analytics, permissions, and monitoring.
  • Amazon S3 commonly serves as durable object storage for data lakes, documents, training data, model artifacts, and retrieval sources.
  • AWS Glue and Lake Formation help organize, catalog, transform, and govern data so AI and analytics teams can use it responsibly.
  • Amazon OpenSearch Service can support search and vector search patterns, while QuickSight supports business intelligence and analytics experiences.
  • Practitioners should evaluate data quality, lineage, sensitivity, ownership, freshness, and access before approving an AI service.
Last updated: May 2026

Data foundations before AI decisions

A model or managed AI service can only be as useful as the data and workflow around it. AWS AI Practitioner candidates should recognize the supporting data services that appear around AI solutions. Amazon S3, AWS Glue, Lake Formation, Amazon OpenSearch Service, Amazon QuickSight, Redshift, and EMR are not all AI services, but they shape whether AI can be built, governed, searched, and measured responsibly.

Amazon S3 is often the durable storage layer. Organizations use S3 for data lakes, raw files, curated datasets, document repositories, logs, images, transcripts, model artifacts, and retrieval sources. The practitioner should ask how buckets are organized, who owns the data, whether encryption is configured, what lifecycle rules apply, whether sensitive data is present, and how access is controlled with IAM, bucket policies, and related governance tools.

AWS Glue helps with data integration, cataloging, and transformation. The Glue Data Catalog can help teams discover tables and metadata. ETL jobs and crawlers can support analytics and ML preparation workflows. A practitioner does not need to implement Glue jobs for this exam target, but should know that messy data rarely becomes useful to AI without cataloging, cleaning, and transformation. Glue is part of the path from raw storage to usable data.

Lake Formation is often discussed with data lake governance. It can help manage permissions and access to data in a lake-centered architecture. If an AI application retrieves from a data lake, the access design matters. Users should not receive answers based on data they are not authorized to view. Data lake permissions, identity, tags, and auditability are important before AI systems are connected to broad sources.

Amazon OpenSearch Service supports search and analytics workloads and can support vector search patterns used in some RAG applications. A vector store is not the same as a foundation model. It stores representations that help retrieve relevant content. The practitioner should ask what data will be indexed, how often it updates, how permissions are enforced, how retrieval quality is measured, and whether a managed service such as Bedrock Knowledge Bases can reduce integration work.

Amazon QuickSight supports business intelligence dashboards and analytics experiences. It helps organizations report trends, inspect metrics, and make decisions from data. QuickSight may also appear around AI projects as the place where usage, quality, adoption, and business outcomes are reported. The practitioner should separate analytics from AI generation. A dashboard that explains sales trends is not automatically an ML model, even if it supports an AI initiative.

Data needAWS service to recognizeAI relevance
Store raw and curated objectsAmazon S3Source documents, datasets, artifacts, logs, and data lake storage.
Catalog and transform dataAWS GlueMakes data discoverable and usable for analytics or ML preparation.
Govern data lake accessLake FormationHelps align permissions with sensitive data and user access boundaries.
Search text or vectorsAmazon OpenSearch ServiceSupports retrieval, search, and vector search patterns for applications.
Visualize metrics and outcomesAmazon QuickSightReports adoption, quality, cost, and business impact around AI workflows.
Analyze warehouse or big dataRedshift or EMRSupports analytics foundations that may feed AI or ML decisions.

Data quality should be reviewed before service selection. For document AI, are scans readable and document types consistent? For personalization, are user events complete and consented? For custom ML, are labels reliable and current? For RAG, are documents authoritative, deduplicated, chunked well, and kept fresh? A model choice cannot compensate for unowned or untrusted source data.

Security and governance controls belong in the data foundation. Sensitive data in S3 might need KMS encryption, Macie discovery, CloudTrail logs, bucket access review, retention policies, and incident response plans. Data catalogs and indexes need ownership. If a document is deleted or access is revoked, the AI system should not keep serving stale or unauthorized content from an index.

A practical data readiness checklist:

  • Identify the system of record and data owner for each source.
  • Classify sensitivity, retention, residency, and consent requirements.
  • Check data quality, freshness, duplicates, missing values, and label reliability.
  • Decide whether data will be extracted, indexed, embedded, transformed, reported, or used for training.
  • Verify IAM, encryption, audit logs, and least-privilege access before connecting AI services.
  • Define metrics that show whether the AI project improves a business outcome, not just whether it runs.

The practitioner should also ask whether analytics is enough. If leaders need a dashboard showing claim volume by category, QuickSight may solve the problem without ML. If users need searchable documents, Kendra or OpenSearch may be enough. If they need generated answers grounded in those documents, add a foundation model layer with controls. The best design is the one that meets the need with the least unnecessary AI risk.

In Skill Builder practice, pair AI labs with data service labs. Notice where source files live, how permissions are granted, how indexes are built, and how results are monitored. This is the bridge between cloud fundamentals and AI fluency: AWS AI solutions are still cloud systems with storage, network, identity, cost, and governance responsibilities.

Test Your Knowledge

A team is building a document-based AI workflow and needs durable storage for raw files, curated documents, logs, and artifacts. Which AWS service is the most common storage foundation?

A
B
C
D
Test Your Knowledge

A company needs to catalog and transform raw data before it can be used for analytics or ML preparation. Which AWS service is most relevant?

A
B
C
D
Test Your Knowledge

A RAG design needs retrieval over indexed text and vector representations. Which AWS service can support search and vector search patterns?

A
B
C
D