2.4 Clustering Models
Key Takeaways
- Clustering is an unsupervised learning technique that groups similar data points together WITHOUT predefined labels.
- K-Means clustering partitions data into K groups by minimizing the distance between data points and their cluster center (centroid).
- Unlike classification, clustering discovers natural groupings in data — the groups are not known in advance.
- Common use cases include customer segmentation, document grouping, anomaly detection, and market research.
- Clustering evaluation focuses on how well-separated and cohesive the clusters are, not on matching predefined labels.
Clustering Models
Quick Answer: Clustering is an unsupervised learning technique that groups similar data points together without predefined labels. K-Means is the most common clustering algorithm. Unlike classification (which assigns to known categories), clustering discovers new groups in the data.
What Is Clustering?
Clustering is an unsupervised machine learning technique that identifies natural groups (clusters) in data. The model has NO labeled data — it discovers patterns and groupings on its own.
Key Characteristics of Clustering
- No labels required — only features (inputs), no predefined categories
- Discovers natural groups — the algorithm finds the groupings
- Number of groups may be unknown — you may need to experiment to find the right number
- Similar items are grouped together — items within a cluster are more similar to each other than to items in other clusters
K-Means Clustering
K-Means is the most widely used clustering algorithm and the one most commonly referenced on the AI-900 exam.
How K-Means Works
- Choose K — Decide how many clusters you want (K = number of clusters)
- Initialize centroids — Place K random points as initial cluster centers
- Assign points — Assign each data point to the nearest centroid
- Update centroids — Move each centroid to the center of its assigned points
- Repeat — Continue assigning and updating until centroids stop moving (convergence)
K-Means Example: Customer Segmentation
A retail company has purchase data for 10,000 customers (features: annual spending, purchase frequency, average order value). They run K-Means with K=3 and discover:
| Cluster | Characteristics | Label (assigned after clustering) |
|---|---|---|
| Cluster 1 | High spending, frequent purchases, high order value | "Premium Customers" |
| Cluster 2 | Medium spending, moderate frequency | "Regular Customers" |
| Cluster 3 | Low spending, infrequent purchases, low order value | "Casual Shoppers" |
Note: The labels "Premium", "Regular", and "Casual" are NOT part of the algorithm — they are human interpretations applied after clustering.
Clustering vs. Classification
This is one of the most commonly tested distinctions on the AI-900:
| Aspect | Classification | Clustering |
|---|---|---|
| Learning type | Supervised | Unsupervised |
| Labels | Required (known categories) | Not used |
| Goal | Assign to known categories | Discover unknown groups |
| Categories | Predefined before training | Discovered during training |
| Example | "Is this email spam?" | "What groups exist in my customer data?" |
| Output | Known class label | Cluster assignment (group number) |
On the Exam: The key differentiator is whether categories are KNOWN in advance. If the question says "categorize into billing, technical, or general" — that is classification (categories are predefined). If the question says "find natural groups in customer data" — that is clustering (groups are discovered).
Common Clustering Use Cases
| Use Case | Data | Discovered Clusters |
|---|---|---|
| Customer segmentation | Purchase history, demographics | Customer types (e.g., premium, budget, seasonal) |
| Document grouping | Text features | Topic groups (e.g., sports, politics, technology) |
| Image grouping | Image features | Visual similarity groups |
| Anomaly detection | Any features | Normal clusters + outliers |
| Gene expression | Gene activity levels | Groups of co-regulated genes |
| Market research | Survey responses | Consumer preference segments |
Evaluating Clustering Models
Since there are no labels to compare against, clustering evaluation uses different approaches:
| Method | What It Measures |
|---|---|
| Silhouette score | How similar items are to their own cluster vs. other clusters (-1 to 1, higher is better) |
| Within-cluster distance | How tightly packed items are within each cluster (lower is better) |
| Between-cluster distance | How separated clusters are from each other (higher is better) |
| Visual inspection | Plot the clusters and assess if groupings make intuitive sense |
On the Exam: You do not need to calculate clustering metrics. Know that good clustering produces well-separated, cohesive groups where items within a cluster are similar to each other and different from items in other clusters.
Which type of machine learning does clustering use?
A marketing team wants to find natural groups in their customer data based on purchasing behavior, without any predefined categories. Which technique should they use?
What is the key difference between classification and clustering?
In K-Means clustering, what does "K" represent?