3.1 Data Collection, EDA, Preprocessing, and Feature Concepts
Key Takeaways
- Good ML work starts with a clear business question, a usable target outcome, and data that is legally and operationally appropriate for the use case.
- EDA helps teams find missing values, skew, leakage, class imbalance, outliers, bias signals, and data quality issues before training begins.
- Preprocessing and feature concepts are in scope at practitioner-awareness level: know what they do, why they matter, and which risks they introduce.
- AWS services such as Amazon S3, AWS Glue, Lake Formation, SageMaker Canvas, and SageMaker Data Wrangler support governed data preparation without making the AI Practitioner a data engineer.
From business question to usable data
For the AWS Certified AI Practitioner scope, data preparation is not about writing feature engineering code or building a data pipeline from scratch. It is about recognizing what must be true before a team can responsibly train, buy, or use an AI/ML solution. Start with the business question. A model for fraud review, churn prediction, document classification, demand forecasting, or image quality control needs a measurable outcome, a feedback path, and data that reflects the decision being supported.
A practitioner should ask whether the target is observable. If a team wants to predict customer dissatisfaction, does it have labeled examples such as survey scores, support escalations, cancellation events, or complaint categories? If it wants to detect defective parts from images, are defect labels consistent across inspectors? If it wants to forecast demand, are stockouts, promotions, holidays, and returns recorded in a way that explains historical sales? A vague goal such as improve operations is not enough.
Data can be structured, semi-structured, or unstructured. Structured data often lives in tables, such as orders, claims, customers, transactions, sensor readings, or account balances. Unstructured data includes text, images, audio, video, and scanned documents. AWS services may help store and catalog these sources. Amazon S3 is commonly used as durable object storage for datasets. AWS Glue can catalog and prepare data. Lake Formation can help govern access to data lakes.
Amazon Textract, Transcribe, Comprehend, and Rekognition can extract or analyze signals from documents, speech, text, and images when a managed AI service fits.
| Data readiness question | Why it matters | Practitioner signal |
|---|---|---|
| Is the outcome defined? | Training needs a target or objective | Business owner can explain success and failure cases |
| Is the data permitted for this use? | Privacy, consent, and policy can block use | Data owner and security team approved the scope |
| Are labels reliable? | Bad labels teach the wrong pattern | Labeling rules are documented and reviewed |
| Is the sample representative? | Models fail on groups not reflected in data | Data covers regions, seasons, channels, and user groups |
| Is leakage controlled? | A model may learn future or forbidden information | Features are checked against deployment-time availability |
| Is there enough history? | Sparse data limits learning and validation | Team can explain coverage and known gaps |
EDA as risk discovery
Exploratory data analysis, often shortened to EDA, is the review phase where teams inspect distributions, relationships, missing values, duplicates, class balance, outliers, and unexpected patterns. The AI Practitioner does not need to calculate every statistic by hand. The important skill is knowing what EDA is supposed to reveal and why a project should not skip it.
Missing values may be harmless, meaningful, or damaging. A blank income field may mean the customer did not provide it, the system failed to collect it, or the value is not applicable. An outlier may be a true high-value transaction, a sensor failure, or a data entry error. Class imbalance matters when the event being predicted is rare, such as fraud, equipment failure, or safety incidents. A naive model can look accurate by predicting the majority class and still be useless for the business.
EDA also helps reveal bias and fairness concerns. If historical approval data reflects biased human decisions, a model trained on that history can reproduce the pattern. If a dataset underrepresents certain customer groups, languages, locations, or device types, the model may perform unevenly. A practitioner should ask how the team evaluated performance across important segments, not just across the whole dataset.
Preprocessing and feature concepts
Preprocessing converts raw data into a form that a model or managed service can use. Common steps include cleaning invalid records, handling missing values, normalizing numeric ranges, converting categories into model-friendly values, splitting text into tokens, resizing images, and separating training and validation data. Feature engineering means creating useful input signals, such as days since last order, number of failed logins in the past hour, average basket size, or a flag for weekend transactions.
These tasks are out of scope as implementation work for the target candidate, but the concepts are in scope. If a model will be used in real time, a feature must be available at real-time inference. A feature based on tomorrow's cancellation status would be leakage. A feature that uses protected personal information may create compliance or fairness concerns. A feature that is expensive to compute may make the solution too slow or costly for production.
Checklist for a practitioner review:
- Confirm the business owner can name the prediction or classification target.
- Confirm the data owner approved the use of each source.
- Ask whether sensitive fields, personally identifiable information, and regulated data are needed or can be minimized.
- Ask how missing values, outliers, duplicates, and class imbalance were handled.
- Ask whether the train, validation, and test splits reflect the deployment situation.
- Ask whether any feature would be unavailable when the model is actually used.
- Ask whether segment-level performance was reviewed for fairness and quality.
AWS tooling can support these steps without requiring the practitioner to become a builder. SageMaker Canvas can help business analysts explore data and create no-code ML models for some use cases. SageMaker Data Wrangler can help prepare and visualize data in a more technical workflow. SageMaker Ground Truth can support data labeling. The service choice should match team skill, governance needs, and the risk of the decision.
The key judgment is simple: poor data readiness should stop or narrow an AI project. More model complexity will not rescue undefined outcomes, unapproved data, biased labels, missing deployment-time signals, or a weak cost-benefit case.
A business team wants an ML model to improve customer happiness but cannot define a measurable target or source labels. What is the best practitioner response?
During EDA, the team finds that only 1 percent of historical transactions are fraud cases. Which risk should the practitioner recognize?
Which example is most likely to be data leakage?