3.1 Data Collection, EDA, Preprocessing, and Feature Concepts

Key Takeaways

  • Good ML work starts with a clear business question, a usable target outcome, and data that is legally and operationally appropriate for the use case.
  • EDA helps teams find missing values, skew, leakage, class imbalance, outliers, bias signals, and data quality issues before training begins.
  • Preprocessing and feature concepts are in scope at practitioner-awareness level: know what they do, why they matter, and which risks they introduce.
  • AWS services such as Amazon S3, AWS Glue, Lake Formation, SageMaker Canvas, and SageMaker Data Wrangler support governed data preparation without making the AI Practitioner a data engineer.
Last updated: May 2026

From business question to usable data

For the AWS Certified AI Practitioner scope, data preparation is not about writing feature engineering code or building a data pipeline from scratch. It is about recognizing what must be true before a team can responsibly train, buy, or use an AI/ML solution. Start with the business question. A model for fraud review, churn prediction, document classification, demand forecasting, or image quality control needs a measurable outcome, a feedback path, and data that reflects the decision being supported.

A practitioner should ask whether the target is observable. If a team wants to predict customer dissatisfaction, does it have labeled examples such as survey scores, support escalations, cancellation events, or complaint categories? If it wants to detect defective parts from images, are defect labels consistent across inspectors? If it wants to forecast demand, are stockouts, promotions, holidays, and returns recorded in a way that explains historical sales? A vague goal such as improve operations is not enough.

Data can be structured, semi-structured, or unstructured. Structured data often lives in tables, such as orders, claims, customers, transactions, sensor readings, or account balances. Unstructured data includes text, images, audio, video, and scanned documents. AWS services may help store and catalog these sources. Amazon S3 is commonly used as durable object storage for datasets. AWS Glue can catalog and prepare data. Lake Formation can help govern access to data lakes.

Amazon Textract, Transcribe, Comprehend, and Rekognition can extract or analyze signals from documents, speech, text, and images when a managed AI service fits.

Data readiness questionWhy it mattersPractitioner signal
Is the outcome defined?Training needs a target or objectiveBusiness owner can explain success and failure cases
Is the data permitted for this use?Privacy, consent, and policy can block useData owner and security team approved the scope
Are labels reliable?Bad labels teach the wrong patternLabeling rules are documented and reviewed
Is the sample representative?Models fail on groups not reflected in dataData covers regions, seasons, channels, and user groups
Is leakage controlled?A model may learn future or forbidden informationFeatures are checked against deployment-time availability
Is there enough history?Sparse data limits learning and validationTeam can explain coverage and known gaps

EDA as risk discovery

Exploratory data analysis, often shortened to EDA, is the review phase where teams inspect distributions, relationships, missing values, duplicates, class balance, outliers, and unexpected patterns. The AI Practitioner does not need to calculate every statistic by hand. The important skill is knowing what EDA is supposed to reveal and why a project should not skip it.

Missing values may be harmless, meaningful, or damaging. A blank income field may mean the customer did not provide it, the system failed to collect it, or the value is not applicable. An outlier may be a true high-value transaction, a sensor failure, or a data entry error. Class imbalance matters when the event being predicted is rare, such as fraud, equipment failure, or safety incidents. A naive model can look accurate by predicting the majority class and still be useless for the business.

EDA also helps reveal bias and fairness concerns. If historical approval data reflects biased human decisions, a model trained on that history can reproduce the pattern. If a dataset underrepresents certain customer groups, languages, locations, or device types, the model may perform unevenly. A practitioner should ask how the team evaluated performance across important segments, not just across the whole dataset.

Preprocessing and feature concepts

Preprocessing converts raw data into a form that a model or managed service can use. Common steps include cleaning invalid records, handling missing values, normalizing numeric ranges, converting categories into model-friendly values, splitting text into tokens, resizing images, and separating training and validation data. Feature engineering means creating useful input signals, such as days since last order, number of failed logins in the past hour, average basket size, or a flag for weekend transactions.

These tasks are out of scope as implementation work for the target candidate, but the concepts are in scope. If a model will be used in real time, a feature must be available at real-time inference. A feature based on tomorrow's cancellation status would be leakage. A feature that uses protected personal information may create compliance or fairness concerns. A feature that is expensive to compute may make the solution too slow or costly for production.

Checklist for a practitioner review:

  • Confirm the business owner can name the prediction or classification target.
  • Confirm the data owner approved the use of each source.
  • Ask whether sensitive fields, personally identifiable information, and regulated data are needed or can be minimized.
  • Ask how missing values, outliers, duplicates, and class imbalance were handled.
  • Ask whether the train, validation, and test splits reflect the deployment situation.
  • Ask whether any feature would be unavailable when the model is actually used.
  • Ask whether segment-level performance was reviewed for fairness and quality.

AWS tooling can support these steps without requiring the practitioner to become a builder. SageMaker Canvas can help business analysts explore data and create no-code ML models for some use cases. SageMaker Data Wrangler can help prepare and visualize data in a more technical workflow. SageMaker Ground Truth can support data labeling. The service choice should match team skill, governance needs, and the risk of the decision.

The key judgment is simple: poor data readiness should stop or narrow an AI project. More model complexity will not rescue undefined outcomes, unapproved data, biased labels, missing deployment-time signals, or a weak cost-benefit case.

Test Your Knowledge

A business team wants an ML model to improve customer happiness but cannot define a measurable target or source labels. What is the best practitioner response?

A
B
C
D
Test Your Knowledge

During EDA, the team finds that only 1 percent of historical transactions are fraud cases. Which risk should the practitioner recognize?

A
B
C
D
Test Your Knowledge

Which example is most likely to be data leakage?

A
B
C
D