Unsupervised LearningIntermediate8 min09 / 13

k-Means Clustering

Learn how k-Means groups unlabeled data into clusters by repeatedly assigning points to their nearest centroid and moving centroids to the center of their group.

Imagine you're a biologist who has just measured the petal length and width of 150 flowers — but nobody told you which species they belong to. You look at the scatter of points on your graph and notice they seem to clump into a few groups. Can an algorithm find those groups automatically, without any labels?

That's exactly what k-Means Clustering does. It's one of the most popular algorithms in all of machine learning, used everywhere from customer segmentation and image compression to anomaly detection and genetics. And unlike the supervised methods you may have seen before, k-Means works with unlabeled data — it discovers structure on its own.

#Supervised vs. Unsupervised Learning

In supervised learning, every training example has a label — you know the right answer and you train the model to predict it. In unsupervised learning, there are no labels. The algorithm finds patterns in raw data by itself.

Clustering is the most common form of unsupervised learning. The goal: divide data points into clusters so that points inside the same cluster are more similar to each other than to points in other clusters. k-Means is the classic algorithm for this — and it's surprisingly simple once you see the loop.

Think of it like

Sorting a Bag of Mixed Candy

Picture tipping a big bag of mixed candy onto a table. You don't have instructions — you just start sorting by similarity: gummy bears in one pile, hard candies in another, chocolates in a third. You keep adjusting the piles until everything feels right. k-Means does the same thing with numbers, using distance as its measure of similarity.

#The k-Means Algorithm, Step by Step

The algorithm revolves around special points called centroids — one per cluster. A centroid is the center point of a cluster. Here's the full loop:

  1. Choose k — decide how many clusters you want.
  2. Initialize — place k centroids at random positions (or pick k random data points).
  3. Assign — for every data point, find the centroid that is closest to it and assign the point to that cluster.
  4. Move — for each cluster, compute the mean (average position) of all its points. Move the centroid there.
  5. Repeat steps 3 and 4 until the centroids stop moving.

k-Means uses Euclidean distance — the straight-line distance between two points. For points (x1, y1) and (x2, y2), that's sqrt((x2-x1)**2 + (y2-y1)**2). The algorithm works in any number of dimensions, not just 2D. The animated visualizer on this page shows you the centroids moving with each iteration — it really clicks once you watch it happen.

One full round: assign each point to the nearest centroid, then move the centroid to the mean of its points.
import math

def distance(p1, p2):
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))

def assign_clusters(points, centroids):
    assignments = []
    for point in points:
        dists = [distance(point, c) for c in centroids]
        assignments.append(dists.index(min(dists)))
    return assignments

def move_centroids(points, assignments, k):
    new_centroids = []
    for cid in range(k):
        pts = [p for p, a in zip(points, assignments) if a == cid]
        dims = len(pts[0])
        mean = tuple(sum(p[d] for p in pts) / len(pts) for d in range(dims))
        new_centroids.append(mean)
    return new_centroids

points    = [(1, 1), (1.5, 2), (3, 4), (5, 7), (3.5, 5), (4.5, 5)]
centroids = [(1, 1), (5, 5)]

assigned = assign_clusters(points, centroids)
print("Assignments:", assigned)

new_c = move_centroids(points, assigned, k=2)
for i, c in enumerate(new_c):
    print(f"New centroid {i}: ({c[0]:.2f}, {c[1]:.2f})")

#The Full Loop

The complete k-Means loop: assign, move, repeat until nothing changes.
def kmeans(points, k, max_iters=10):
    centroids = points[:k]  # pick first k points as starting centroids
    for i in range(max_iters):
        assignments   = assign_clusters(points, centroids)
        new_centroids = move_centroids(points, assignments, k)
        if new_centroids == centroids:
            print(f"Converged after {i + 1} iteration(s).")
            break
        centroids = new_centroids
    return assignments, centroids

assignments, final_c = kmeans(points, k=2)
print("Final assignments:", assignments)
for i, c in enumerate(final_c):
    print(f"Final centroid {i}: ({c[0]:.2f}, {c[1]:.2f})")

#Choosing k: The Elbow Method

k-Means requires you to choose k before running. But how do you know the right number of clusters?

The elbow method: run k-Means for k = 1, 2, 3, … and each time measure the within-cluster sum of squared distances (WCSS) — how far each point is from its centroid, squared, then summed. As k increases, WCSS always shrinks (more clusters = closer centers). But at some point the improvement drops sharply, creating an elbow in the plot. That elbow is usually the best k.

In practice, scikit-learn's sklearn.cluster.KMeans handles all of this, but knowing the from-scratch mechanics means you'll never treat it as a black box.

Common mistake

k-Means Can Get Stuck in Bad Solutions

Because initial centroids are placed randomly, k-Means can converge to a local optimum — clusters that look okay but aren't the global best. The fix: run k-Means multiple times with different random starts and keep the best result. Scikit-learn does this automatically (n_init=10 by default). The k-Means++ initialization strategy also helps by spreading starting centroids out deliberately, giving the algorithm a smarter head start.

Quick check

After the assignment step, what does k-Means do to update each centroid?

Tip

When to Use k-Means (and When Not To)

k-Means works great when clusters are roughly spherical and similarly sized. It struggles with elongated, nested, or very differently sized clusters. For those cases, DBSCAN (density-based) or Gaussian Mixture Models are better choices. Always visualize your data first when you can!

Key takeaways

  • k-Means is an unsupervised algorithm — it finds structure in unlabeled data by grouping points into k clusters.
  • The algorithm loops: assign each point to the nearest centroid, then move each centroid to the mean of its assigned points, until nothing changes.
  • Distance is Euclidean (straight-line), and the algorithm works in any number of dimensions.
  • The elbow method helps pick k by finding where adding more clusters stops giving big WCSS improvements.
  • Random initialization can trap k-Means in suboptimal solutions; run it multiple times or use k-Means++ initialization.
Try it yourself · Clustering in motion
Assign, move, repeat — watch centroids settle into clusters.
clusters k =
① assign points to nearest centroid

k-means repeats two steps until stable: assign each point to its nearest star (centroid), then move each star to the middle of its points.

step 1 / 6
Practice challenges
Test yourself · earn XP
0/5
Predict the output#1

Using the lesson's distance helper, what does this snippet print? It measures which centroid the point (3, 4) is closest to.

predict-output
import math

def distance(p1, p2):
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))

point     = (3, 4)
centroids = [(1, 1), (5, 5)]

dists = [distance(point, c) for c in centroids]
print(dists.index(min(dists)))
Predict the output#2

In the assignment step, each point joins its nearest centroid. Given these points and centroids, what does this print?

predict-output
import math

def distance(p1, p2):
    return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))

def assign_clusters(points, centroids):
    out = []
    for point in points:
        dists = [distance(point, c) for c in centroids]
        out.append(dists.index(min(dists)))
    return out

points    = [(1, 1), (1.5, 2), (5, 7)]
centroids = [(1, 1), (5, 5)]
print(assign_clusters(points, centroids))
Fill in the blank#3

Complete the centroid update (the 'move' step). Each new centroid is the ____ position of all points assigned to that cluster. Fill in the operation that turns a sum of coordinates into that value.

def move_centroids(points, assignments, k):
    new_centroids = []
    for cid in range(k):
        pts = [p for p, a in zip(points, assignments) if a == cid]
        dims = len(pts[0])
        mean = tuple(sum(p[d] for p in pts)  len(pts) for d in range(dims))
        new_centroids.append(mean)
    return new_centroids
Reorder the lines#4

Put the body of the k-Means loop in the correct order for one iteration, including the convergence check that stops when centroids stop moving.

1
centroids = new_centroids
2
    break
3
new_centroids = move_centroids(points, assignments, k)
4
if new_centroids == centroids:
5
assignments   = assign_clusters(points, centroids)
Fix the bug#5

This code is meant to compute the within-cluster sum of squared distances (WCSS) for the elbow method. It has a bug — what's wrong?

fix-bug
import math

def wcss(points, assignments, centroids):
    total = 0
    for point, cid in zip(points, assignments):
        c = centroids[cid]
        total += math.sqrt(sum((a - b) ** 2 for a, b in zip(point, c)))
    return total
Your turn
Practice exercise

Complete the wcss function below. It should compute the within-cluster sum of squared distances — for every point, calculate its squared distance to its assigned centroid and add it to the total. Run it on the provided data and print the result.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable