2.1 Core Machine Learning Concepts

Key Takeaways

  • Machine learning is a subset of AI where models learn patterns from data — the model improves with more data, not more explicit programming.
  • Features are the input variables used for prediction (e.g., square footage, number of bedrooms); labels are the output values the model predicts (e.g., house price).
  • Training data teaches the model patterns; validation data tunes the model; test data evaluates final performance on unseen data.
  • Supervised learning uses labeled data (features + known labels); unsupervised learning discovers patterns in unlabeled data.
  • The machine learning workflow follows: collect data → prepare data → train model → evaluate model → deploy model → monitor model.
Last updated: March 2026

Core Machine Learning Concepts

Quick Answer: Machine learning trains models on data to learn patterns and make predictions without explicit programming. Key concepts include features (inputs), labels (outputs), training data (for learning), validation data (for tuning), and test data (for evaluation). Supervised learning uses labeled data; unsupervised learning finds patterns in unlabeled data.

What Is Machine Learning?

Machine learning (ML) is a subset of artificial intelligence where computer systems learn from data and improve their performance over time without being explicitly programmed for every possible scenario. Instead of writing rules like "if temperature > 100 then alert", you provide examples of normal and abnormal temperatures and let the model learn the boundary.

Traditional Programming vs. Machine Learning

ApproachInputProcessOutput
Traditional programmingData + RulesExecute rules on dataResults
Machine learningData + Expected ResultsLearn rules from dataModel (rules)

In traditional programming, you write the rules. In machine learning, the algorithm discovers the rules by analyzing patterns in the data.

Features and Labels

Understanding features and labels is essential for the AI-900:

Features (also called attributes or input variables) are the characteristics of the data that the model uses to make predictions. Think of features as the "questions" the model considers.

Labels (also called target variables or output variables) are the values the model tries to predict. Think of labels as the "answers" the model produces.

ScenarioFeatures (Inputs)Label (Output)
House price predictionSquare footage, bedrooms, location, agePrice ($)
Email spam detectionSubject line, sender, body text, linksSpam / Not Spam
Customer churn predictionAccount age, usage, complaints, paymentsChurn / Stay
Medical diagnosisSymptoms, test results, age, medical historyDisease / No Disease
Temperature forecastingDate, location, humidity, wind speedTemperature (°F)

On the Exam: When a question describes a dataset with input columns and an output column, the input columns are features and the output column is the label. If the question says "predict the price based on size, location, and age" — size, location, and age are features; price is the label.

Training, Validation, and Test Data

Machine learning requires splitting your data into three subsets:

Training Data (typically 60-80% of all data)

  • Used to teach the model patterns
  • The model adjusts its internal parameters based on this data
  • Larger training sets generally produce better models

Validation Data (typically 10-20% of all data)

  • Used to tune the model during training
  • Helps prevent overfitting by evaluating the model on data it has not trained on
  • Used to select the best model configuration (hyperparameters)

Test Data (typically 10-20% of all data)

  • Used to evaluate final model performance on completely unseen data
  • Provides an unbiased estimate of how the model will perform in production
  • Only used AFTER the model is fully trained and tuned
Total Dataset
├── Training Data (70%)  → Model learns patterns here
├── Validation Data (15%) → Model is tuned here
└── Test Data (15%)       → Final evaluation here

On the Exam: The key distinction: training data teaches the model, validation data tunes it, and test data provides the final unbiased evaluation. A question might ask which dataset is used to prevent overfitting (validation) or which provides an unbiased performance estimate (test).

Overfitting and Underfitting

ProblemDescriptionSymptomSolution
OverfittingModel memorizes training data instead of learning general patternsGreat on training data, poor on new dataMore data, simpler model, regularization
UnderfittingModel is too simple to capture patterns in the dataPoor on both training and new dataMore complex model, more features, more training
Good fitModel learns general patterns that apply to new dataGood performance on both training and new dataThe goal of ML

Supervised vs. Unsupervised Learning

Supervised Learning

The model learns from labeled data — data where both features and labels are known. The "supervision" comes from the labeled examples that guide the learning process.

Types of supervised learning:

  • Regression — predict continuous numerical values (price, temperature, quantity)
  • Classification — predict discrete categories (spam/not spam, disease/no disease)

Unsupervised Learning

The model discovers patterns in unlabeled data — data where only features are known, with no predefined labels. The model finds natural groupings or structures in the data.

Types of unsupervised learning:

  • Clustering — group similar items together (customer segments, document topics)

Comparison

AspectSupervised LearningUnsupervised Learning
DataLabeled (features + labels)Unlabeled (features only)
GoalPredict known output categoriesDiscover hidden patterns
TypesRegression, ClassificationClustering
ExamplePredict house price from featuresGroup customers by behavior
EvaluationCompare predictions to known labelsAssess cluster quality and separation

The Machine Learning Workflow

The end-to-end ML process follows these steps:

  1. Define the problem — What business question are you trying to answer?
  2. Collect data — Gather relevant data from databases, APIs, files
  3. Prepare data — Clean, transform, handle missing values, select features
  4. Split data — Divide into training, validation, and test sets
  5. Choose an algorithm — Select the appropriate ML algorithm for the task
  6. Train the model — Feed training data to the algorithm
  7. Evaluate the model — Test performance on validation and test data
  8. Tune the model — Adjust hyperparameters to improve performance
  9. Deploy the model — Publish as an endpoint for applications to consume
  10. Monitor the model — Track performance and retrain when accuracy degrades
Test Your Knowledge

In a machine learning model that predicts house prices based on square footage, number of bedrooms, and location, what are the "features"?

A
B
C
D
Test Your Knowledge

Which dataset is used to provide an unbiased evaluation of a fully trained machine learning model?

A
B
C
D
Test Your Knowledge

A machine learning model performs very well on training data but poorly on new, unseen data. What problem is this called?

A
B
C
D
Test Your Knowledge

Which type of learning uses labeled data where both features and outcomes are known?

A
B
C
D
Test Your KnowledgeMatching

Match each machine learning concept to its definition:

Match each item on the left with the correct item on the right

1
Features
2
Labels
3
Training data
4
Validation data
5
Test data