2.1 Core Machine Learning Concepts
Key Takeaways
- Machine learning is a subset of AI where models learn patterns from data — the model improves with more data, not more explicit programming.
- Features are the input variables used for prediction (e.g., square footage, number of bedrooms); labels are the output values the model predicts (e.g., house price).
- Training data teaches the model patterns; validation data tunes the model; test data evaluates final performance on unseen data.
- Supervised learning uses labeled data (features + known labels); unsupervised learning discovers patterns in unlabeled data.
- The machine learning workflow follows: collect data → prepare data → train model → evaluate model → deploy model → monitor model.
Core Machine Learning Concepts
Quick Answer: Machine learning trains models on data to learn patterns and make predictions without explicit programming. Key concepts include features (inputs), labels (outputs), training data (for learning), validation data (for tuning), and test data (for evaluation). Supervised learning uses labeled data; unsupervised learning finds patterns in unlabeled data.
What Is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence where computer systems learn from data and improve their performance over time without being explicitly programmed for every possible scenario. Instead of writing rules like "if temperature > 100 then alert", you provide examples of normal and abnormal temperatures and let the model learn the boundary.
Traditional Programming vs. Machine Learning
| Approach | Input | Process | Output |
|---|---|---|---|
| Traditional programming | Data + Rules | Execute rules on data | Results |
| Machine learning | Data + Expected Results | Learn rules from data | Model (rules) |
In traditional programming, you write the rules. In machine learning, the algorithm discovers the rules by analyzing patterns in the data.
Features and Labels
Understanding features and labels is essential for the AI-900:
Features (also called attributes or input variables) are the characteristics of the data that the model uses to make predictions. Think of features as the "questions" the model considers.
Labels (also called target variables or output variables) are the values the model tries to predict. Think of labels as the "answers" the model produces.
| Scenario | Features (Inputs) | Label (Output) |
|---|---|---|
| House price prediction | Square footage, bedrooms, location, age | Price ($) |
| Email spam detection | Subject line, sender, body text, links | Spam / Not Spam |
| Customer churn prediction | Account age, usage, complaints, payments | Churn / Stay |
| Medical diagnosis | Symptoms, test results, age, medical history | Disease / No Disease |
| Temperature forecasting | Date, location, humidity, wind speed | Temperature (°F) |
On the Exam: When a question describes a dataset with input columns and an output column, the input columns are features and the output column is the label. If the question says "predict the price based on size, location, and age" — size, location, and age are features; price is the label.
Training, Validation, and Test Data
Machine learning requires splitting your data into three subsets:
Training Data (typically 60-80% of all data)
- Used to teach the model patterns
- The model adjusts its internal parameters based on this data
- Larger training sets generally produce better models
Validation Data (typically 10-20% of all data)
- Used to tune the model during training
- Helps prevent overfitting by evaluating the model on data it has not trained on
- Used to select the best model configuration (hyperparameters)
Test Data (typically 10-20% of all data)
- Used to evaluate final model performance on completely unseen data
- Provides an unbiased estimate of how the model will perform in production
- Only used AFTER the model is fully trained and tuned
Total Dataset
├── Training Data (70%) → Model learns patterns here
├── Validation Data (15%) → Model is tuned here
└── Test Data (15%) → Final evaluation here
On the Exam: The key distinction: training data teaches the model, validation data tunes it, and test data provides the final unbiased evaluation. A question might ask which dataset is used to prevent overfitting (validation) or which provides an unbiased performance estimate (test).
Overfitting and Underfitting
| Problem | Description | Symptom | Solution |
|---|---|---|---|
| Overfitting | Model memorizes training data instead of learning general patterns | Great on training data, poor on new data | More data, simpler model, regularization |
| Underfitting | Model is too simple to capture patterns in the data | Poor on both training and new data | More complex model, more features, more training |
| Good fit | Model learns general patterns that apply to new data | Good performance on both training and new data | The goal of ML |
Supervised vs. Unsupervised Learning
Supervised Learning
The model learns from labeled data — data where both features and labels are known. The "supervision" comes from the labeled examples that guide the learning process.
Types of supervised learning:
- Regression — predict continuous numerical values (price, temperature, quantity)
- Classification — predict discrete categories (spam/not spam, disease/no disease)
Unsupervised Learning
The model discovers patterns in unlabeled data — data where only features are known, with no predefined labels. The model finds natural groupings or structures in the data.
Types of unsupervised learning:
- Clustering — group similar items together (customer segments, document topics)
Comparison
| Aspect | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Data | Labeled (features + labels) | Unlabeled (features only) |
| Goal | Predict known output categories | Discover hidden patterns |
| Types | Regression, Classification | Clustering |
| Example | Predict house price from features | Group customers by behavior |
| Evaluation | Compare predictions to known labels | Assess cluster quality and separation |
The Machine Learning Workflow
The end-to-end ML process follows these steps:
- Define the problem — What business question are you trying to answer?
- Collect data — Gather relevant data from databases, APIs, files
- Prepare data — Clean, transform, handle missing values, select features
- Split data — Divide into training, validation, and test sets
- Choose an algorithm — Select the appropriate ML algorithm for the task
- Train the model — Feed training data to the algorithm
- Evaluate the model — Test performance on validation and test data
- Tune the model — Adjust hyperparameters to improve performance
- Deploy the model — Publish as an endpoint for applications to consume
- Monitor the model — Track performance and retrain when accuracy degrades
In a machine learning model that predicts house prices based on square footage, number of bedrooms, and location, what are the "features"?
Which dataset is used to provide an unbiased evaluation of a fully trained machine learning model?
A machine learning model performs very well on training data but poorly on new, unseen data. What problem is this called?
Which type of learning uses labeled data where both features and outcomes are known?
Match each machine learning concept to its definition:
Match each item on the left with the correct item on the right