Supervised LearningBeginner⏱ 8 min07 / 13

k-Nearest Neighbors

Classify anything by asking its neighbors — learn how k-NN works by computing distances and taking a majority vote, all from scratch.

Imagine you just moved to a new city and you're deciding where to eat dinner. You don't know the area, so you look around: the three nearest restaurants all have long lines of happy customers. "If the locals nearby love it, I probably will too." That gut feeling — judge something by what surrounds it — is the entire idea behind k-Nearest Neighbors.

k-NN is one of the most intuitive algorithms in all of machine learning. It classifies a new data point by looking at the k closest labeled points in the training set and letting them vote. Whoever gets the most votes wins. No equations to solve, no mysterious training phase — just memory and a measuring tape.

#The Core Idea: Majority Vote Among Neighbors

Here is k-NN in three sentences:

Store all your labeled training examples — every point keeps its label.
When a new, unlabeled point arrives, measure its distance to every stored point.
Look at the k closest ones and predict whichever label appears most often.

That's it. There is no model, no weights, no optimization loop. k-NN is sometimes called a lazy learner because it does zero work at training time — all the work happens at prediction time.

Think of it like

The Neighborhood Election

Think of each labeled training point as a house with a political sign in the yard — red or blue. When a stranger arrives at an unknown address, they look at the k nearest houses. If 3 out of 5 nearby houses are red, the newcomer is classified as red.

Increase k and you consider a wider neighborhood — opinions become more stable but may blur boundaries. Decrease k and you zoom in tight — opinions are razor-sharp but one noisy neighbor can swing the vote.

#Measuring Distance: The Euclidean Formula

Before we can find the nearest neighbors, we need to define nearness. The most common choice is Euclidean distance — the straight-line distance you'd measure with a ruler.

For two points A = (x1, y1) and B = (x2, y2), the formula is:

`` distance = sqrt( (x2 - x1)² + (y2 - y1)² ) ``

In plain words: subtract each coordinate, square the difference (so negatives become positive), add them up, then take the square root to get back to real-world units. It's just the Pythagorean theorem — the distance formula from geometry class.

A clean Euclidean distance helper — works for 2D, 3D, or any number of features.

1
2
3
4
5
6
7
8import math

def euclidean_distance(a, b):
    """Distance between two points (tuples of any length)."""
    return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))

print(euclidean_distance((1, 2), (4, 6)))  # classic 3-4-5 triangle
print(euclidean_distance((0, 0), (1, 1)))

#Building k-NN from Scratch

Let's build a complete k-NN classifier in pure Python — no external libraries. We'll classify fruits based on two features: sweetness (1–10) and crunchiness (1–10). Each fruit in our training set is labeled "apple" or "orange".

Under the hood: sort by distance, slice the top k, tally votes. That's the whole algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39import math

# Training data: (sweetness, crunchiness, label)
training = [
    (7, 9, "apple"),
    (6, 8, "apple"),
    (8, 7, "apple"),
    (9, 2, "orange"),
    (8, 3, "orange"),
    (7, 1, "orange"),
]

def euclidean_distance(a, b):
    return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))

def knn_predict(query, data, k=3):
    # Step 1: compute distance from query to every training point
    distances = []
    for *features, label in data:
        d = euclidean_distance(query, features)
        distances.append((d, label))

    # Step 2: sort by distance, keep the k closest
    distances.sort(key=lambda x: x[0])
    neighbors = distances[:k]

    # Step 3: majority vote
    votes = {}
    for _, label in neighbors:
        votes[label] = votes.get(label, 0) + 1

    return max(votes, key=votes.get), neighbors

# Classify a mystery fruit: sweetness=7, crunchiness=8
prediction, neighbors = knn_predict((7, 8), training, k=3)
print("Prediction:", prediction)
print("Nearest neighbors:")
for dist, label in neighbors:
    print(f"  {label}  (distance: {dist:.2f})")

Tip

The Interactive Visualizer

The animation on this page lets you click to place a new point and watch the circles expand outward until they touch k neighbors. Try dragging k up and down to see how the decision boundary changes — it's the fastest way to build intuition for why k matters.

#The Effect of k: Small vs Large

The choice of k is the most important decision in k-NN:

Small k (e.g., k=1): The model listens only to the single nearest neighbor. It draws very jagged, irregular boundaries. One mislabeled or noisy training point can flip a prediction. The model has high variance — it wiggles a lot.
Large k (e.g., k=20): The model considers a wide vote. Boundaries become smooth and stable. But if k is too large, it starts ignoring real local patterns and can underfit — it just predicts the most common class everywhere.
Sweet spot: Try odd values (to avoid ties) in the middle — often somewhere between 3 and 15. Use cross-validation to tune it.

The same query point gets a different label as k grows. More neighbors = a wider view of the neighborhood.

1
2
3
4
5# Same data, different k — watch the prediction flip or hold
for k in [1, 3, 5]:
    pred, nbrs = knn_predict((8, 4), training, k=k)
    labels = [lbl for _, lbl in nbrs]
    print(f"k={k}: neighbors={labels}  -> predicted: {pred}")

Common mistake

Feature Scaling is Critical

If one feature is measured in kilometers and another in millimeters, the large-scale feature will dominate the distance calculation and drown out the small-scale one. Always normalize or standardize your features before using k-NN so that each dimension contributes fairly to the distance. This is the #1 mistake beginners make with k-NN.

#Why 'Lazy' Is Actually a Superpower (and a Weakness)

Lazy learning means k-NN skips the training phase entirely — it just stores the data. This has real advantages:

Instant updates: Add new training examples any time. The model is always current.
No assumptions: k-NN doesn't assume the data is linear, normally distributed, or any particular shape. It adapts to whatever structure the data has.
Works out of the box on surprisingly complex decision boundaries.

But there's a cost:

Slow at prediction time: You must compute distances to every training point for every new query. With millions of points, this gets expensive. (Libraries like scikit-learn use clever data structures like KD-trees to speed this up.)
Memory hungry: The entire training set must live in memory.

Note

k-NN in the Real World

k-NN powers recommendation systems ("customers who bought items similar to yours also bought..."), image recognition, anomaly detection, and medical diagnosis. In scikit-learn, it's just from sklearn.neighbors import KNeighborsClassifier — but now you know exactly what's happening inside that one line.

Quick check

You train a k-NN classifier with k=1 and get perfect accuracy on the training data, but poor accuracy on new test data. What is most likely happening?

Key takeaways

k-NN classifies a new point by finding the k closest training examples and taking a majority vote — no math beyond counting and distance.
Euclidean distance is the measuring tape: subtract coordinates, square, sum, square-root.
Small k = high variance (noisy, jagged boundaries); large k = high bias (smooth but possibly wrong). Tune k with cross-validation.
k-NN is a lazy learner: zero training time, but prediction scans the whole dataset — great for small data, slow for huge datasets.
Always normalize your features before using k-NN, or a single large-scale feature will dominate all distance calculations.

Try it yourself · Nearest neighbors

Move the new point and see how its k neighbors vote on its class.

k =

prediction

Class A

The square is a new point. It's classified by a majority vote of its 3 nearest neighbors. Click anywhere to move it.

Practice challenges

Test yourself · earn XP

0/4

Predict the output#1

This is the Euclidean distance helper from the lesson. What does it print?

predict-output

1
2
3
4
5
6import math

def euclidean_distance(a, b):
    return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))

print(euclidean_distance((0, 0), (6, 8)))

Fix the bug#2

This k-NN classifier is supposed to keep the k CLOSEST neighbors, but it predicts the wrong label. What's wrong?

fix-bug

1
2
3
4
5
6
7
8
9
10
11
12
13def knn_predict(query, data, k=3):
    distances = []
    for *features, label in data:
        d = euclidean_distance(query, features)
        distances.append((d, label))

    distances.sort(key=lambda x: x[0], reverse=True)
    neighbors = distances[:k]

    votes = {}
    for _, label in neighbors:
        votes[label] = votes.get(label, 0) + 1
    return max(votes, key=votes.get)

Fill in the blank#3

Complete the two core steps of k-NN prediction: after computing distances, sort them ascending, then keep only the k nearest neighbors.

def knn_predict(query, data, k=3):
    distances = []
    for *features, label in data:
        d = euclidean_distance(query, features)
        distances.append((d, label))

    distances.sort(key=lambda x: x[0])
    neighbors = distances[:]

    votes = {}
    for _, label in neighbors:
        votes[label] = votes.get(label, 0) + 1
    return max(votes, key=votes.get)

Reorder the lines#4

Put these lines in the correct order to make a k-NN prediction: compute every distance, sort by nearest, keep the k closest, tally the votes, and return the winning label.

distances.sort(key=lambda x: x[0])

return max(votes, key=votes.get)

distances = [(euclidean_distance(query, feat), lbl) for *feat, lbl in data]

votes = {}

neighbors = distances[:k]

for _, label in neighbors:
    votes[label] = votes.get(label, 0) + 1

Your turn

Practice exercise

You have a small dataset of animals described by two features: weight (kg) and speed (km/h). Each animal is labeled "predator" or "prey".

Complete the knn_classify function below so that it: 1. Computes the Euclidean distance from the query point to every point in data. 2. Sorts the results and picks the k nearest neighbors. 3. Returns the majority-vote label.

Then call it to classify a mystery animal with weight=15 and speed=40, using k=3.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable

import math

data = [
    (5,  60, "prey"),
    (3,  70, "prey"),
    (8,  50, "prey"),
    (80, 30, "predator"),
    (60, 25, "predator"),
    (90, 20, "predator"),
]

def knn_classify(query, data, k=3):
    # 1. Compute distances
    distances = []
    for row in data:
        features = row[:-1]   # all but last element
        label    = row[-1]
        # YOUR CODE: compute euclidean distance and append (dist, label)
        pass

    # 2. Sort and pick top k
    # YOUR CODE

    # 3. Majority vote and return the winning label
    # YOUR CODE
    pass

result = knn_classify((15, 40), data, k=3)
print("Classification:", result)