2.4 Clustering Models

Key Takeaways

  • Clustering is an unsupervised learning technique that groups similar data points together WITHOUT predefined labels.
  • K-Means clustering partitions data into K groups by minimizing the distance between data points and their cluster center (centroid).
  • Unlike classification, clustering discovers natural groupings in data — the groups are not known in advance.
  • Common use cases include customer segmentation, document grouping, anomaly detection, and market research.
  • Clustering evaluation focuses on how well-separated and cohesive the clusters are, not on matching predefined labels.
Last updated: March 2026

Clustering Models

Quick Answer: Clustering is an unsupervised learning technique that groups similar data points together without predefined labels. K-Means is the most common clustering algorithm. Unlike classification (which assigns to known categories), clustering discovers new groups in the data.

What Is Clustering?

Clustering is an unsupervised machine learning technique that identifies natural groups (clusters) in data. The model has NO labeled data — it discovers patterns and groupings on its own.

Key Characteristics of Clustering

  • No labels required — only features (inputs), no predefined categories
  • Discovers natural groups — the algorithm finds the groupings
  • Number of groups may be unknown — you may need to experiment to find the right number
  • Similar items are grouped together — items within a cluster are more similar to each other than to items in other clusters

K-Means Clustering

K-Means is the most widely used clustering algorithm and the one most commonly referenced on the AI-900 exam.

How K-Means Works

  1. Choose K — Decide how many clusters you want (K = number of clusters)
  2. Initialize centroids — Place K random points as initial cluster centers
  3. Assign points — Assign each data point to the nearest centroid
  4. Update centroids — Move each centroid to the center of its assigned points
  5. Repeat — Continue assigning and updating until centroids stop moving (convergence)

K-Means Example: Customer Segmentation

A retail company has purchase data for 10,000 customers (features: annual spending, purchase frequency, average order value). They run K-Means with K=3 and discover:

ClusterCharacteristicsLabel (assigned after clustering)
Cluster 1High spending, frequent purchases, high order value"Premium Customers"
Cluster 2Medium spending, moderate frequency"Regular Customers"
Cluster 3Low spending, infrequent purchases, low order value"Casual Shoppers"

Note: The labels "Premium", "Regular", and "Casual" are NOT part of the algorithm — they are human interpretations applied after clustering.

Clustering vs. Classification

This is one of the most commonly tested distinctions on the AI-900:

AspectClassificationClustering
Learning typeSupervisedUnsupervised
LabelsRequired (known categories)Not used
GoalAssign to known categoriesDiscover unknown groups
CategoriesPredefined before trainingDiscovered during training
Example"Is this email spam?""What groups exist in my customer data?"
OutputKnown class labelCluster assignment (group number)

On the Exam: The key differentiator is whether categories are KNOWN in advance. If the question says "categorize into billing, technical, or general" — that is classification (categories are predefined). If the question says "find natural groups in customer data" — that is clustering (groups are discovered).

Common Clustering Use Cases

Use CaseDataDiscovered Clusters
Customer segmentationPurchase history, demographicsCustomer types (e.g., premium, budget, seasonal)
Document groupingText featuresTopic groups (e.g., sports, politics, technology)
Image groupingImage featuresVisual similarity groups
Anomaly detectionAny featuresNormal clusters + outliers
Gene expressionGene activity levelsGroups of co-regulated genes
Market researchSurvey responsesConsumer preference segments

Evaluating Clustering Models

Since there are no labels to compare against, clustering evaluation uses different approaches:

MethodWhat It Measures
Silhouette scoreHow similar items are to their own cluster vs. other clusters (-1 to 1, higher is better)
Within-cluster distanceHow tightly packed items are within each cluster (lower is better)
Between-cluster distanceHow separated clusters are from each other (higher is better)
Visual inspectionPlot the clusters and assess if groupings make intuitive sense

On the Exam: You do not need to calculate clustering metrics. Know that good clustering produces well-separated, cohesive groups where items within a cluster are similar to each other and different from items in other clusters.

Test Your Knowledge

Which type of machine learning does clustering use?

A
B
C
D
Test Your Knowledge

A marketing team wants to find natural groups in their customer data based on purchasing behavior, without any predefined categories. Which technique should they use?

A
B
C
D
Test Your Knowledge

What is the key difference between classification and clustering?

A
B
C
D
Test Your Knowledge

In K-Means clustering, what does "K" represent?

A
B
C
D