A business team wants an ML model to improve customer happiness but cannot define a measurable target or source labels. What is the best practitioner response?

Ask the team to define observable outcomes, labels, data sources, and success criteria first. A model needs a defined objective and usable data. The practitioner should push for a measurable target and data readiness before training or procurement.

During EDA, the team finds that only 1 percent of historical transactions are fraud cases. Which risk should the practitioner recognize?

Class imbalance may make simple accuracy misleading. Rare events can make accuracy look high while the model misses the minority class. The team should use metrics and validation that reflect the fraud objective.

Which example is most likely to be data leakage?

A field that records whether the account was closed next month when predicting closure today. Using future information that would not be available at prediction time can inflate evaluation results and fail in production.

Data Collection, EDA, Preprocessing, and Fea | Free Guide 2026

From business question to usable data

For the AWS Certified AI Practitioner scope, data preparation is not about writing feature engineering code or building a data pipeline from scratch. It is about recognizing what must be true before a team can responsibly train, buy, or use an AI/ML solution. Start with the business question. A model for fraud review, churn prediction, document classification, demand forecasting, or image quality control needs a measurable outcome, a feedback path, and data that reflects the decision being supported.

A practitioner should ask whether the target is observable. If a team wants to predict customer dissatisfaction, does it have labeled examples such as survey scores, support escalations, cancellation events, or complaint categories? If it wants to detect defective parts from images, are defect labels consistent across inspectors? If it wants to forecast demand, are stockouts, promotions, holidays, and returns recorded in a way that explains historical sales? A vague goal such as improve operations is not enough.

Data can be structured, semi-structured, or unstructured. Structured data often lives in tables, such as orders, claims, customers, transactions, sensor readings, or account balances. Unstructured data includes text, images, audio, video, and scanned documents. AWS services may help store and catalog these sources. Amazon S3 is commonly used as durable object storage for datasets. AWS Glue can catalog and prepare data. Lake Formation can help govern access to data lakes.

Amazon Textract, Transcribe, Comprehend, and Rekognition can extract or analyze signals from documents, speech, text, and images when a managed AI service fits.

Data readiness question	Why it matters	Practitioner signal
Is the outcome defined?	Training needs a target or objective	Business owner can explain success and failure cases
Is the data permitted for this use?	Privacy, consent, and policy can block use	Data owner and security team approved the scope
Are labels reliable?	Bad labels teach the wrong pattern	Labeling rules are documented and reviewed
Is the sample representative?	Models fail on groups not reflected in data	Data covers regions, seasons, channels, and user groups
Is leakage controlled?	A model may learn future or forbidden information	Features are checked against deployment-time availability
Is there enough history?	Sparse data limits learning and validation	Team can explain coverage and known gaps

EDA as risk discovery

Exploratory data analysis, often shortened to EDA, is the review phase where teams inspect distributions, relationships, missing values, duplicates, class balance, outliers, and unexpected patterns. The AI Practitioner does not need to calculate every statistic by hand. The important skill is knowing what EDA is supposed to reveal and why a project should not skip it.

Missing values may be harmless, meaningful, or damaging. A blank income field may mean the customer did not provide it, the system failed to collect it, or the value is not applicable. An outlier may be a true high-value transaction, a sensor failure, or a data entry error. Class imbalance matters when the event being predicted is rare, such as fraud, equipment failure, or safety incidents. A naive model can look accurate by predicting the majority class and still be useless for the business.

EDA also helps reveal bias and fairness concerns. If historical approval data reflects biased human decisions, a model trained on that history can reproduce the pattern. If a dataset underrepresents certain customer groups, languages, locations, or device types, the model may perform unevenly. A practitioner should ask how the team evaluated performance across important segments, not just across the whole dataset.

Preprocessing and feature concepts

Preprocessing converts raw data into a form that a model or managed service can use. Common steps include cleaning invalid records, handling missing values, normalizing numeric ranges, converting categories into model-friendly values, splitting text into tokens, resizing images, and separating training and validation data. Feature engineering means creating useful input signals, such as days since last order, number of failed logins in the past hour, average basket size, or a flag for weekend transactions.

These tasks are out of scope as implementation work for the target candidate, but the concepts are in scope. If a model will be used in real time, a feature must be available at real-time inference. A feature based on tomorrow's cancellation status would be leakage. A feature that uses protected personal information may create compliance or fairness concerns. A feature that is expensive to compute may make the solution too slow or costly for production.

Checklist for a practitioner review:

Confirm the business owner can name the prediction or classification target.
Confirm the data owner approved the use of each source.
Ask whether sensitive fields, personally identifiable information, and regulated data are needed or can be minimized.
Ask how missing values, outliers, duplicates, and class imbalance were handled.
Ask whether the train, validation, and test splits reflect the deployment situation.
Ask whether any feature would be unavailable when the model is actually used.
Ask whether segment-level performance was reviewed for fairness and quality.

AWS tooling can support these steps without requiring the practitioner to become a builder. SageMaker Canvas can help business analysts explore data and create no-code ML models for some use cases. SageMaker Data Wrangler can help prepare and visualize data in a more technical workflow. SageMaker Ground Truth can support data labeling. The service choice should match team skill, governance needs, and the risk of the decision.

The key judgment is simple: poor data readiness should stop or narrow an AI project. More model complexity will not rescue undefined outcomes, unapproved data, biased labels, missing deployment-time signals, or a weak cost-benefit case.

AWS AI Practitioner Study Guide

3.1 Data Collection, EDA, Preprocessing, and Feature Concepts

Key Takeaways

From business question to usable data

EDA as risk discovery

Preprocessing and feature concepts

AWS AI Practitioner Study Guide

1Chapter 1: AIF-C01 Orientation and Official Source Control

2Chapter 2: AI/ML Foundations and Use-Case Fit

3Chapter 3: ML Lifecycle, Metrics, and Practitioner MLOps

4Chapter 4: Generative AI Foundations and Inference Concepts

5Chapter 5: Prompting, Model Selection, Customization, and Evaluation

6Chapter 6: Amazon Bedrock, RAG, Agents, and Guardrails

7Chapter 7: AWS Managed AI/ML Services and SageMaker Map

8Chapter 8: Responsible AI, Human Review, and Safety

9Chapter 9: Security, Compliance, Governance, and Cost Controls

10Chapter 10: Integrated AWS AI Business Scenario Labs

11Chapter 11: Final Review, Exam Readiness, and Recertification

3.1 Data Collection, EDA, Preprocessing, and Feature Concepts

Key Takeaways

From business question to usable data

EDA as risk discovery

Preprocessing and feature concepts