2.3 Data Types, Labels, Structure, and Quality
Key Takeaways
- Data type and structure strongly influence which AWS AI service or ML path is practical.
- Labels must be accurate, current, and aligned with the decision the business wants to improve.
- Data quality issues such as missing values, skew, duplicates, stale records, and leakage can make a model misleading.
- Governance starts before modeling with access control, privacy review, retention rules, and clear data ownership.
Data Shape Drives the Architecture
Structured data is organized into predictable fields, such as rows and columns in a database, CSV files in Amazon S3, or tables in a warehouse. Tabular fields might include customer age, region, product type, order amount, and outcome. This data often supports classification, regression, forecasting, reporting, and segmentation. It is usually easier to validate than free-form text, but it can still contain bias, missing values, and stale definitions.
Unstructured data does not arrive in neat columns. Documents, emails, chats, images, audio files, videos, and PDFs require extraction or transformation before many analytics and ML tasks. AWS managed AI services can help: Amazon Textract for documents, Amazon Transcribe for speech, Amazon Rekognition for image and video analysis, Amazon Comprehend for text insights, and Amazon Translate for language translation.
Time-series data is ordered by time and is common in demand planning, sensor monitoring, capacity planning, and financial activity. The time order matters. Randomly mixing past and future records can create leakage, where the model appears accurate because it saw information that would not exist at prediction time. A practitioner should ask how training, validation, and reporting preserve the timeline.
| Data type | Example | Common first AWS consideration | Key question |
|---|---|---|---|
| Tabular | Orders, claims, tickets | S3, Glue, Redshift, SageMaker Canvas, SageMaker AI | Are fields consistent and labels trustworthy? |
| Text | Reviews, chats, knowledge articles | Comprehend, Bedrock, Amazon Q, OpenSearch Service | Is the text sensitive, current, and grounded? |
| Image or video | Product photos, safety footage | Rekognition or custom vision in SageMaker AI | Are labels and consent clear? |
| Document | Forms, invoices, contracts | Textract, A2I for review, S3 | What must be extracted and verified by humans? |
| Audio | Calls, dictation | Transcribe, Comprehend after transcript | What accuracy is required and what language is used? |
| Time series | Demand, usage, telemetry | Analytics plus SageMaker options | Does validation respect time order? |
Labels are the answers used for supervised learning. A label can be a class, a number, a score, or a final decision. Label quality is a business problem before it is a modeling problem. If agents disagree about ticket categories, a model trained on those categories will reproduce the confusion. If historical loan approvals reflect a policy that has since changed, old labels may no longer match the approved process.
Data structure is also about meaning. A field named status might mean open, closed, returned, charged off, or resolved depending on the system. A region field might be a sales territory in one table and a data residency area in another. AWS Glue can help catalog and prepare data, and Lake Formation can help govern access, but owners still need shared definitions. The glossary matters.
Data quality checks should happen before service selection. Missing values can hide process failures. Duplicates can overweight certain customers. Outliers can represent errors or important edge cases. Skewed data can underrepresent minority groups, rare fraud, low-volume regions, or seasonal events. Stale records can teach a model a past business reality. Sensitive fields can create privacy and compliance exposure if copied into prompts or training sets.
Use this data readiness checklist:
- The business owner can explain each important field and label.
- The dataset covers the populations, channels, regions, and time periods where the model will be used.
- Sensitive data is identified, classified, and protected with IAM, encryption, and retention rules.
- The team can separate training data from evaluation data without leakage.
- The expected action from a prediction is legal, ethical, and explainable enough for the use case.
- The team knows who monitors quality drift after launch.
Service implications follow from these checks. If documents are the main input, starting with Amazon Textract plus human review for low-confidence fields is often more practical than building a custom model. If customer support text is the input, Amazon Comprehend, Amazon Q, or a Bedrock workflow may fit depending on whether the goal is extraction, search, summarization, or guided assistance. If the prediction is company-specific, SageMaker Canvas or SageMaker AI may be appropriate.
A foundation model workflow has additional data concerns. Prompts can contain sensitive information, retrieved context can be stale, and generated output can sound confident even when source content is weak. Retrieval augmented generation can improve grounding, but only if the knowledge source is curated, access-controlled, refreshed, and logged. A practitioner should ask whether the model can access only the content the user is allowed to see.
Scenario: a hospital administration group wants to summarize internal policy documents. The data is text, not tabular, and privacy controls matter. A Bedrock or Amazon Q style workflow may help, but the project needs approved sources, IAM access boundaries, logging decisions, and a human review process for policy-sensitive answers. If the user needs exact policy citations, retrieval and source display are more important than creative generation.
Scenario: a manufacturer wants to detect defects from product images. The team needs representative images of good and defective products, including lighting, camera angle, product line, and rare defect types. Amazon Rekognition can cover common image analysis tasks, while a custom model path may be needed for specialized defects. The practitioner should ask how false accepts and false rejects affect safety, cost, and customer trust.
A model appears highly accurate because the training data includes a field that is only known after the business decision is complete. What problem is most likely present?
A company wants to extract fields from scanned forms and route uncertain values to employees for review. Which service pairing is most relevant?
Which question should a practitioner ask before using historical labels to train a supervised model?