k-Means Clustering
Learn how k-Means groups unlabeled data into clusters by repeatedly assigning points to their nearest centroid and moving centroids to the center of their group.
Imagine you're a biologist who has just measured the petal length and width of 150 flowers — but nobody told you which species they belong to. You look at the scatter of points on your graph and notice they seem to clump into a few groups. Can an algorithm find those groups automatically, without any labels?
That's exactly what k-Means Clustering does. It's one of the most popular algorithms in all of machine learning, used everywhere from customer segmentation and image compression to anomaly detection and genetics. And unlike the supervised methods you may have seen before, k-Means works with unlabeled data — it discovers structure on its own.
#Supervised vs. Unsupervised Learning
In supervised learning, every training example has a label — you know the right answer and you train the model to predict it. In unsupervised learning, there are no labels. The algorithm finds patterns in raw data by itself.
Clustering is the most common form of unsupervised learning. The goal: divide data points into clusters so that points inside the same cluster are more similar to each other than to points in other clusters. k-Means is the classic algorithm for this — and it's surprisingly simple once you see the loop.
Sorting a Bag of Mixed Candy
Picture tipping a big bag of mixed candy onto a table. You don't have instructions — you just start sorting by similarity: gummy bears in one pile, hard candies in another, chocolates in a third. You keep adjusting the piles until everything feels right. k-Means does the same thing with numbers, using distance as its measure of similarity.
#The k-Means Algorithm, Step by Step
The algorithm revolves around special points called centroids — one per cluster. A centroid is the center point of a cluster. Here's the full loop:
- Choose k — decide how many clusters you want.
- Initialize — place k centroids at random positions (or pick k random data points).
- Assign — for every data point, find the centroid that is closest to it and assign the point to that cluster.
- Move — for each cluster, compute the mean (average position) of all its points. Move the centroid there.
- Repeat steps 3 and 4 until the centroids stop moving.
k-Means uses Euclidean distance — the straight-line distance between two points. For points (x1, y1) and (x2, y2), that's sqrt((x2-x1)**2 + (y2-y1)**2). The algorithm works in any number of dimensions, not just 2D. The animated visualizer on this page shows you the centroids moving with each iteration — it really clicks once you watch it happen.
import math
def distance(p1, p2):
return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))
def assign_clusters(points, centroids):
assignments = []
for point in points:
dists = [distance(point, c) for c in centroids]
assignments.append(dists.index(min(dists)))
return assignments
def move_centroids(points, assignments, k):
new_centroids = []
for cid in range(k):
pts = [p for p, a in zip(points, assignments) if a == cid]
dims = len(pts[0])
mean = tuple(sum(p[d] for p in pts) / len(pts) for d in range(dims))
new_centroids.append(mean)
return new_centroids
points = [(1, 1), (1.5, 2), (3, 4), (5, 7), (3.5, 5), (4.5, 5)]
centroids = [(1, 1), (5, 5)]
assigned = assign_clusters(points, centroids)
print("Assignments:", assigned)
new_c = move_centroids(points, assigned, k=2)
for i, c in enumerate(new_c):
print(f"New centroid {i}: ({c[0]:.2f}, {c[1]:.2f})")#The Full Loop
def kmeans(points, k, max_iters=10):
centroids = points[:k] # pick first k points as starting centroids
for i in range(max_iters):
assignments = assign_clusters(points, centroids)
new_centroids = move_centroids(points, assignments, k)
if new_centroids == centroids:
print(f"Converged after {i + 1} iteration(s).")
break
centroids = new_centroids
return assignments, centroids
assignments, final_c = kmeans(points, k=2)
print("Final assignments:", assignments)
for i, c in enumerate(final_c):
print(f"Final centroid {i}: ({c[0]:.2f}, {c[1]:.2f})")#Choosing k: The Elbow Method
k-Means requires you to choose k before running. But how do you know the right number of clusters?
The elbow method: run k-Means for k = 1, 2, 3, … and each time measure the within-cluster sum of squared distances (WCSS) — how far each point is from its centroid, squared, then summed. As k increases, WCSS always shrinks (more clusters = closer centers). But at some point the improvement drops sharply, creating an elbow in the plot. That elbow is usually the best k.
In practice, scikit-learn's sklearn.cluster.KMeans handles all of this, but knowing the from-scratch mechanics means you'll never treat it as a black box.
k-Means Can Get Stuck in Bad Solutions
Because initial centroids are placed randomly, k-Means can converge to a local optimum — clusters that look okay but aren't the global best. The fix: run k-Means multiple times with different random starts and keep the best result. Scikit-learn does this automatically (n_init=10 by default). The k-Means++ initialization strategy also helps by spreading starting centroids out deliberately, giving the algorithm a smarter head start.
After the assignment step, what does k-Means do to update each centroid?
When to Use k-Means (and When Not To)
k-Means works great when clusters are roughly spherical and similarly sized. It struggles with elongated, nested, or very differently sized clusters. For those cases, DBSCAN (density-based) or Gaussian Mixture Models are better choices. Always visualize your data first when you can!
Key takeaways
- k-Means is an unsupervised algorithm — it finds structure in unlabeled data by grouping points into k clusters.
- The algorithm loops: assign each point to the nearest centroid, then move each centroid to the mean of its assigned points, until nothing changes.
- Distance is Euclidean (straight-line), and the algorithm works in any number of dimensions.
- The elbow method helps pick k by finding where adding more clusters stops giving big WCSS improvements.
- Random initialization can trap k-Means in suboptimal solutions; run it multiple times or use k-Means++ initialization.
k-means repeats two steps until stable: assign each point to its nearest star (centroid), then move each star to the middle of its points.
Using the lesson's distance helper, what does this snippet print? It measures which centroid the point (3, 4) is closest to.
import math
def distance(p1, p2):
return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))
point = (3, 4)
centroids = [(1, 1), (5, 5)]
dists = [distance(point, c) for c in centroids]
print(dists.index(min(dists)))In the assignment step, each point joins its nearest centroid. Given these points and centroids, what does this print?
import math
def distance(p1, p2):
return math.sqrt(sum((a - b) ** 2 for a, b in zip(p1, p2)))
def assign_clusters(points, centroids):
out = []
for point in points:
dists = [distance(point, c) for c in centroids]
out.append(dists.index(min(dists)))
return out
points = [(1, 1), (1.5, 2), (5, 7)]
centroids = [(1, 1), (5, 5)]
print(assign_clusters(points, centroids))Complete the centroid update (the 'move' step). Each new centroid is the ____ position of all points assigned to that cluster. Fill in the operation that turns a sum of coordinates into that value.
def move_centroids(points, assignments, k): new_centroids = [] for cid in range(k): pts = [p for p, a in zip(points, assignments) if a == cid] dims = len(pts[0]) mean = tuple(sum(p[d] for p in pts) len(pts) for d in range(dims)) new_centroids.append(mean) return new_centroids
Put the body of the k-Means loop in the correct order for one iteration, including the convergence check that stops when centroids stop moving.
centroids = new_centroids
break
new_centroids = move_centroids(points, assignments, k)
if new_centroids == centroids:
assignments = assign_clusters(points, centroids)
This code is meant to compute the within-cluster sum of squared distances (WCSS) for the elbow method. It has a bug — what's wrong?
import math
def wcss(points, assignments, centroids):
total = 0
for point, cid in zip(points, assignments):
c = centroids[cid]
total += math.sqrt(sum((a - b) ** 2 for a, b in zip(point, c)))
return totalComplete the wcss function below. It should compute the within-cluster sum of squared distances — for every point, calculate its squared distance to its assigned centroid and add it to the total. Run it on the provided data and print the result.
Try it live — edit the code and hit Run to execute real Python: