AWS AI Practitioner Study Guide

Exam

1.1 Credential Scope, Code, and Search Language 1.2 Exam Format, Cost, Score, and Delivery 1.3 Official Exam Guide, Skill Builder, and Practice Workflow 1.4 Target Candidate Boundaries and Out-of-Scope Tasks 1.5 Domain Weights and Study Prioritization 1.6 Exam Policy, Retake, Results, and Recertification

2.1 AI, ML, Deep Learning, and Core Terminology 2.2 Supervised, Unsupervised, and Reinforcement Learning 2.3 Data Types, Labels, Structure, and Quality 2.4 Inference Patterns: Batch, Real-Time, and Embedded AI 2.5 AI Use-Case Fit and No-AI Decisions 2.6 Classification, Regression, Clustering, Forecasting, and Recommendation 2.7 AI/ML Foundations Case Lab

3.1 Data Collection, EDA, Preprocessing, and Feature Concepts 3.2 Training, Evaluation, Deployment, and Monitoring Lifecycle 3.3 Model Sources: Managed APIs, Open Source, and Custom Models 3.4 Evaluation Metrics and Business Metrics 3.5 Practitioner MLOps: Repeatability, Monitoring, and Retraining 3.6 SageMaker Lifecycle Service Map 3.7 ML Lifecycle Case Lab

4.1 Foundation Models, LLMs, Transformers, and Modalities 4.2 Tokens, Context Windows, Embeddings, and Vector Search 4.3 Inference Parameters: Temperature, Top-p, and Output Controls 4.4 Hallucination, Grounding, and Context Quality 4.5 GenAI Use-Case Fit and Risk Triage 4.6 Prompt Injection, Data Leakage, and User Input Risk 4.7 Generative AI Foundations Case Lab

5.1 Prompt Engineering Patterns and Business Quality 5.2 Zero-Shot, Few-Shot, Templates, and Instruction Design 5.3 Model Selection: Capability, Latency, Cost, and Risk 5.4 RAG vs Fine-Tuning vs Prompting vs Custom Models 5.5 Model Evaluation, Human Review, and Red-Team Feedback 5.6 Cost, Performance, and Throughput Decision-Making 5.7 Prompting and Model Selection Case Lab

6.1 Bedrock Core Concepts, Model Access, and Managed FM Choice 6.2 Knowledge Bases, RAG, Data Sources, and Grounding 6.3 Agents, Action Groups, Orchestration, and Business Workflows 6.4 Guardrails, Content Filters, Denied Topics, and Sensitive Data 6.5 Bedrock Model Evaluation, Monitoring, and Human Feedback 6.6 Bedrock Cost, Latency, Throughput, and Operational Fit 6.7 Bedrock, RAG, and Agents Case Lab

7.1 Managed AI Services vs Foundation Model Apps vs Custom ML 7.2 Text, Language, Search, and Document AI Services 7.3 Vision, Speech, Contact Center, and Personalization Services 7.4 Amazon Q Business, Developer, and Practitioner Fit 7.5 SageMaker Canvas, Studio, Clarify, Autopilot, and Data Wrangler 7.6 Data Foundation Services: S3, Glue, OpenSearch, and QuickSight 7.7 AWS AI Service Selection Case Lab

8.1 Fairness, Bias, Transparency, and Explainability 8.2 Privacy, Safety, Human Review, and Accountability 8.3 Guardrails, Clarify, A2I, and Content Safety Controls 8.4 Responsible AI Risk Registers and Governance Workflows 8.5 Monitoring, Feedback, Drift, and Incident Response 8.6 Responsible AI Case Lab

9.1 Shared Responsibility, IAM, and Least Privilege for AI 9.2 Encryption, Secrets, Networking, and Data Privacy 9.3 Prompt Injection, Data Exfiltration, and GenAI Threat Modeling 9.4 Logging, Monitoring, CloudTrail, CloudWatch, and Config 9.5 Compliance Artifact, Audit Manager, Macie, and Policy Evidence 9.6 AI Cost Controls, Pricing, Throughput, and Budget Governance 9.7 Security and Governance Case Lab

10.1 Customer Support GenAI Assistant Lab 10.2 Document Intelligence and Compliance Review Lab 10.3 Personalization, Forecasting, and Fraud Detection Lab 10.4 Enterprise Search, RAG, and Knowledge Management Lab 10.5 Responsible AI and Security Review Board Lab 10.6 Cost, Performance, and Operations Review Lab 10.7 Full AIF-C01 Business Simulation

11.1 Final 30-Day AIF-C01 Study Plan 11.2 Official Practice Resources and Weak-Domain Remediation 11.3 90-Minute Exam Timing, Flagging, and Guessing Workflow 11.4 Test-Day Checklist: Online or Test Center 11.5 Post-Exam Results, Retake, and Recertification Plan 11.6 AWS AI Practitioner Final Mixed Review

2.3 Data Types, Labels, Structure, and Quality

Key Takeaways

Data type and structure strongly influence which AWS AI service or ML path is practical.
Labels must be accurate, current, and aligned with the decision the business wants to improve.
Data quality issues such as missing values, skew, duplicates, stale records, and leakage can make a model misleading.
Governance starts before modeling with access control, privacy review, retention rules, and clear data ownership.

Last updated: May 2026

Data Shape Drives the Architecture

Structured data is organized into predictable fields, such as rows and columns in a database, CSV files in Amazon S3, or tables in a warehouse. Tabular fields might include customer age, region, product type, order amount, and outcome. This data often supports classification, regression, forecasting, reporting, and segmentation. It is usually easier to validate than free-form text, but it can still contain bias, missing values, and stale definitions.

Unstructured data does not arrive in neat columns. Documents, emails, chats, images, audio files, videos, and PDFs require extraction or transformation before many analytics and ML tasks. AWS managed AI services can help: Amazon Textract for documents, Amazon Transcribe for speech, Amazon Rekognition for image and video analysis, Amazon Comprehend for text insights, and Amazon Translate for language translation.

Time-series data is ordered by time and is common in demand planning, sensor monitoring, capacity planning, and financial activity. The time order matters. Randomly mixing past and future records can create leakage, where the model appears accurate because it saw information that would not exist at prediction time. A practitioner should ask how training, validation, and reporting preserve the timeline.

Data type	Example	Common first AWS consideration	Key question
Tabular	Orders, claims, tickets	S3, Glue, Redshift, SageMaker Canvas, SageMaker AI	Are fields consistent and labels trustworthy?
Text	Reviews, chats, knowledge articles	Comprehend, Bedrock, Amazon Q, OpenSearch Service	Is the text sensitive, current, and grounded?
Image or video	Product photos, safety footage	Rekognition or custom vision in SageMaker AI	Are labels and consent clear?
Document	Forms, invoices, contracts	Textract, A2I for review, S3	What must be extracted and verified by humans?
Audio	Calls, dictation	Transcribe, Comprehend after transcript	What accuracy is required and what language is used?
Time series	Demand, usage, telemetry	Analytics plus SageMaker options	Does validation respect time order?

Labels are the answers used for supervised learning. A label can be a class, a number, a score, or a final decision. Label quality is a business problem before it is a modeling problem. If agents disagree about ticket categories, a model trained on those categories will reproduce the confusion. If historical loan approvals reflect a policy that has since changed, old labels may no longer match the approved process.

Data structure is also about meaning. A field named status might mean open, closed, returned, charged off, or resolved depending on the system. A region field might be a sales territory in one table and a data residency area in another. AWS Glue can help catalog and prepare data, and Lake Formation can help govern access, but owners still need shared definitions. The glossary matters.

Data quality checks should happen before service selection. Missing values can hide process failures. Duplicates can overweight certain customers. Outliers can represent errors or important edge cases. Skewed data can underrepresent minority groups, rare fraud, low-volume regions, or seasonal events. Stale records can teach a model a past business reality. Sensitive fields can create privacy and compliance exposure if copied into prompts or training sets.

Use this data readiness checklist:

The business owner can explain each important field and label.
The dataset covers the populations, channels, regions, and time periods where the model will be used.
Sensitive data is identified, classified, and protected with IAM, encryption, and retention rules.
The team can separate training data from evaluation data without leakage.
The expected action from a prediction is legal, ethical, and explainable enough for the use case.
The team knows who monitors quality drift after launch.

Service implications follow from these checks. If documents are the main input, starting with Amazon Textract plus human review for low-confidence fields is often more practical than building a custom model. If customer support text is the input, Amazon Comprehend, Amazon Q, or a Bedrock workflow may fit depending on whether the goal is extraction, search, summarization, or guided assistance. If the prediction is company-specific, SageMaker Canvas or SageMaker AI may be appropriate.

A foundation model workflow has additional data concerns. Prompts can contain sensitive information, retrieved context can be stale, and generated output can sound confident even when source content is weak. Retrieval augmented generation can improve grounding, but only if the knowledge source is curated, access-controlled, refreshed, and logged. A practitioner should ask whether the model can access only the content the user is allowed to see.

Scenario: a hospital administration group wants to summarize internal policy documents. The data is text, not tabular, and privacy controls matter. A Bedrock or Amazon Q style workflow may help, but the project needs approved sources, IAM access boundaries, logging decisions, and a human review process for policy-sensitive answers. If the user needs exact policy citations, retrieval and source display are more important than creative generation.

Scenario: a manufacturer wants to detect defects from product images. The team needs representative images of good and defective products, including lighting, camera angle, product line, and rare defect types. Amazon Rekognition can cover common image analysis tasks, while a custom model path may be needed for specialized defects. The practitioner should ask how false accepts and false rejects affect safety, cost, and customer trust.

Test Your Knowledge

A model appears highly accurate because the training data includes a field that is only known after the business decision is complete. What problem is most likely present?

Data leakage

Speech synthesis

Encryption at rest

A larger context window

Test Your Knowledge

A company wants to extract fields from scanned forms and route uncertain values to employees for review. Which service pairing is most relevant?

Amazon Textract with Amazon A2I style human review

Amazon Polly with Amazon Translate only

Amazon CloudFront with no extraction step

AWS Artifact with Amazon EC2 placement groups

Test Your Knowledge

Which question should a practitioner ask before using historical labels to train a supervised model?

Do the labels accurately represent the current decision the business wants to improve?

Can labels be ignored once data is in Amazon S3?

Will an LLM remove all privacy obligations?

Can IAM be skipped for no-code tools?

Up Next

2.4 Inference Patterns: Batch, Real-Time, and Embedded AI

Continue learning

AWS AI Practitioner Study Guide

1Chapter 1: AIF-C01 Orientation and Official Source Control

2Chapter 2: AI/ML Foundations and Use-Case Fit

3Chapter 3: ML Lifecycle, Metrics, and Practitioner MLOps

4Chapter 4: Generative AI Foundations and Inference Concepts

5Chapter 5: Prompting, Model Selection, Customization, and Evaluation

6Chapter 6: Amazon Bedrock, RAG, Agents, and Guardrails

7Chapter 7: AWS Managed AI/ML Services and SageMaker Map

8Chapter 8: Responsible AI, Human Review, and Safety

9Chapter 9: Security, Compliance, Governance, and Cost Controls

10Chapter 10: Integrated AWS AI Business Scenario Labs

11Chapter 11: Final Review, Exam Readiness, and Recertification

2.3 Data Types, Labels, Structure, and Quality

Key Takeaways

Data Shape Drives the Architecture