A team is building a document-based AI workflow and needs durable storage for raw files, curated documents, logs, and artifacts. Which AWS service is the most common storage foundation?

Amazon S3. Amazon S3 is durable object storage commonly used for data lakes, documents, datasets, logs, and artifacts around AI and analytics workflows.

A company needs to catalog and transform raw data before it can be used for analytics or ML preparation. Which AWS service is most relevant?

AWS Glue. AWS Glue supports data integration, cataloging, and transformation. It helps move data from raw storage toward usable analytics and ML preparation.

A RAG design needs retrieval over indexed text and vector representations. Which AWS service can support search and vector search patterns?

Amazon OpenSearch Service. Amazon OpenSearch Service can support search and vector search patterns used for retrieval in some AI applications.

Data Foundation Services: S3, Glue, OpenSear | Free Guide 2026

Data foundations before AI decisions

A model or managed AI service can only be as useful as the data and workflow around it. AWS AI Practitioner candidates should recognize the supporting data services that appear around AI solutions. Amazon S3, AWS Glue, Lake Formation, Amazon OpenSearch Service, Amazon QuickSight, Redshift, and EMR are not all AI services, but they shape whether AI can be built, governed, searched, and measured responsibly.

Amazon S3 is often the durable storage layer. Organizations use S3 for data lakes, raw files, curated datasets, document repositories, logs, images, transcripts, model artifacts, and retrieval sources. The practitioner should ask how buckets are organized, who owns the data, whether encryption is configured, what lifecycle rules apply, whether sensitive data is present, and how access is controlled with IAM, bucket policies, and related governance tools.

AWS Glue helps with data integration, cataloging, and transformation. The Glue Data Catalog can help teams discover tables and metadata. ETL jobs and crawlers can support analytics and ML preparation workflows. A practitioner does not need to implement Glue jobs for this exam target, but should know that messy data rarely becomes useful to AI without cataloging, cleaning, and transformation. Glue is part of the path from raw storage to usable data.

Lake Formation is often discussed with data lake governance. It can help manage permissions and access to data in a lake-centered architecture. If an AI application retrieves from a data lake, the access design matters. Users should not receive answers based on data they are not authorized to view. Data lake permissions, identity, tags, and auditability are important before AI systems are connected to broad sources.

Amazon OpenSearch Service supports search and analytics workloads and can support vector search patterns used in some RAG applications. A vector store is not the same as a foundation model. It stores representations that help retrieve relevant content. The practitioner should ask what data will be indexed, how often it updates, how permissions are enforced, how retrieval quality is measured, and whether a managed service such as Bedrock Knowledge Bases can reduce integration work.

Amazon QuickSight supports business intelligence dashboards and analytics experiences. It helps organizations report trends, inspect metrics, and make decisions from data. QuickSight may also appear around AI projects as the place where usage, quality, adoption, and business outcomes are reported. The practitioner should separate analytics from AI generation. A dashboard that explains sales trends is not automatically an ML model, even if it supports an AI initiative.

Data need	AWS service to recognize	AI relevance
Store raw and curated objects	Amazon S3	Source documents, datasets, artifacts, logs, and data lake storage.
Catalog and transform data	AWS Glue	Makes data discoverable and usable for analytics or ML preparation.
Govern data lake access	Lake Formation	Helps align permissions with sensitive data and user access boundaries.
Search text or vectors	Amazon OpenSearch Service	Supports retrieval, search, and vector search patterns for applications.
Visualize metrics and outcomes	Amazon QuickSight	Reports adoption, quality, cost, and business impact around AI workflows.
Analyze warehouse or big data	Redshift or EMR	Supports analytics foundations that may feed AI or ML decisions.

Data quality should be reviewed before service selection. For document AI, are scans readable and document types consistent? For personalization, are user events complete and consented? For custom ML, are labels reliable and current? For RAG, are documents authoritative, deduplicated, chunked well, and kept fresh? A model choice cannot compensate for unowned or untrusted source data.

Security and governance controls belong in the data foundation. Sensitive data in S3 might need KMS encryption, Macie discovery, CloudTrail logs, bucket access review, retention policies, and incident response plans. Data catalogs and indexes need ownership. If a document is deleted or access is revoked, the AI system should not keep serving stale or unauthorized content from an index.

A practical data readiness checklist:

Identify the system of record and data owner for each source.
Classify sensitivity, retention, residency, and consent requirements.
Check data quality, freshness, duplicates, missing values, and label reliability.
Decide whether data will be extracted, indexed, embedded, transformed, reported, or used for training.
Verify IAM, encryption, audit logs, and least-privilege access before connecting AI services.
Define metrics that show whether the AI project improves a business outcome, not just whether it runs.

The practitioner should also ask whether analytics is enough. If leaders need a dashboard showing claim volume by category, QuickSight may solve the problem without ML. If users need searchable documents, Kendra or OpenSearch may be enough. If they need generated answers grounded in those documents, add a foundation model layer with controls. The best design is the one that meets the need with the least unnecessary AI risk.

In Skill Builder practice, pair AI labs with data service labs. Notice where source files live, how permissions are granted, how indexes are built, and how results are monitored. This is the bridge between cloud fundamentals and AI fluency: AWS AI solutions are still cloud systems with storage, network, identity, cost, and governance responsibilities.

AWS AI Practitioner Study Guide

AWS Certified AI Practitioner

7.6 Data Foundation Services: S3, Glue, OpenSearch, and QuickSight

Key Takeaways

Data foundations before AI decisions

AWS AI Practitioner Study Guide

1Chapter 1: AIF-C01 Orientation and Official Source Control

2Chapter 2: AI/ML Foundations and Use-Case Fit

3Chapter 3: ML Lifecycle, Metrics, and Practitioner MLOps

4Chapter 4: Generative AI Foundations and Inference Concepts

5Chapter 5: Prompting, Model Selection, Customization, and Evaluation

6Chapter 6: Amazon Bedrock, RAG, Agents, and Guardrails

7Chapter 7: AWS Managed AI/ML Services and SageMaker Map

8Chapter 8: Responsible AI, Human Review, and Safety

9Chapter 9: Security, Compliance, Governance, and Cost Controls

10Chapter 10: Integrated AWS AI Business Scenario Labs

11Chapter 11: Final Review, Exam Readiness, and Recertification

AWS Certified AI Practitioner

7.6 Data Foundation Services: S3, Glue, OpenSearch, and QuickSight

Key Takeaways

Data foundations before AI decisions