Supervised LearningBeginner8 min07 / 13

k-Nearest Neighbors

Classify anything by asking its neighbors — learn how k-NN works by computing distances and taking a majority vote, all from scratch.

Imagine you just moved to a new city and you're deciding where to eat dinner. You don't know the area, so you look around: the three nearest restaurants all have long lines of happy customers. "If the locals nearby love it, I probably will too." That gut feeling — judge something by what surrounds it — is the entire idea behind k-Nearest Neighbors.

k-NN is one of the most intuitive algorithms in all of machine learning. It classifies a new data point by looking at the k closest labeled points in the training set and letting them vote. Whoever gets the most votes wins. No equations to solve, no mysterious training phase — just memory and a measuring tape.

#The Core Idea: Majority Vote Among Neighbors

Here is k-NN in three sentences:

  1. Store all your labeled training examples — every point keeps its label.
  2. When a new, unlabeled point arrives, measure its distance to every stored point.
  3. Look at the k closest ones and predict whichever label appears most often.

That's it. There is no model, no weights, no optimization loop. k-NN is sometimes called a lazy learner because it does zero work at training time — all the work happens at prediction time.

Think of it like

The Neighborhood Election

Think of each labeled training point as a house with a political sign in the yard — red or blue. When a stranger arrives at an unknown address, they look at the k nearest houses. If 3 out of 5 nearby houses are red, the newcomer is classified as red.

Increase k and you consider a wider neighborhood — opinions become more stable but may blur boundaries. Decrease k and you zoom in tight — opinions are razor-sharp but one noisy neighbor can swing the vote.

#Measuring Distance: The Euclidean Formula

Before we can find the nearest neighbors, we need to define nearness. The most common choice is Euclidean distance — the straight-line distance you'd measure with a ruler.

For two points A = (x1, y1) and B = (x2, y2), the formula is:

`` distance = sqrt( (x2 - x1)² + (y2 - y1)² ) ``

In plain words: subtract each coordinate, square the difference (so negatives become positive), add them up, then take the square root to get back to real-world units. It's just the Pythagorean theorem — the distance formula from geometry class.

A clean Euclidean distance helper — works for 2D, 3D, or any number of features.
import math

def euclidean_distance(a, b):
    """Distance between two points (tuples of any length)."""
    return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))

print(euclidean_distance((1, 2), (4, 6)))  # classic 3-4-5 triangle
print(euclidean_distance((0, 0), (1, 1)))

#Building k-NN from Scratch

Let's build a complete k-NN classifier in pure Python — no external libraries. We'll classify fruits based on two features: sweetness (1–10) and crunchiness (1–10). Each fruit in our training set is labeled "apple" or "orange".

Under the hood: sort by distance, slice the top k, tally votes. That's the whole algorithm.
import math

# Training data: (sweetness, crunchiness, label)
training = [
    (7, 9, "apple"),
    (6, 8, "apple"),
    (8, 7, "apple"),
    (9, 2, "orange"),
    (8, 3, "orange"),
    (7, 1, "orange"),
]

def euclidean_distance(a, b):
    return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))

def knn_predict(query, data, k=3):
    # Step 1: compute distance from query to every training point
    distances = []
    for *features, label in data:
        d = euclidean_distance(query, features)
        distances.append((d, label))

    # Step 2: sort by distance, keep the k closest
    distances.sort(key=lambda x: x[0])
    neighbors = distances[:k]

    # Step 3: majority vote
    votes = {}
    for _, label in neighbors:
        votes[label] = votes.get(label, 0) + 1

    return max(votes, key=votes.get), neighbors

# Classify a mystery fruit: sweetness=7, crunchiness=8
prediction, neighbors = knn_predict((7, 8), training, k=3)
print("Prediction:", prediction)
print("Nearest neighbors:")
for dist, label in neighbors:
    print(f"  {label}  (distance: {dist:.2f})")
Tip

The Interactive Visualizer

The animation on this page lets you click to place a new point and watch the circles expand outward until they touch k neighbors. Try dragging k up and down to see how the decision boundary changes — it's the fastest way to build intuition for why k matters.

#The Effect of k: Small vs Large

The choice of k is the most important decision in k-NN:

  • Small k (e.g., k=1): The model listens only to the single nearest neighbor. It draws very jagged, irregular boundaries. One mislabeled or noisy training point can flip a prediction. The model has high variance — it wiggles a lot.
  • Large k (e.g., k=20): The model considers a wide vote. Boundaries become smooth and stable. But if k is too large, it starts ignoring real local patterns and can underfit — it just predicts the most common class everywhere.
  • Sweet spot: Try odd values (to avoid ties) in the middle — often somewhere between 3 and 15. Use cross-validation to tune it.
The same query point gets a different label as k grows. More neighbors = a wider view of the neighborhood.
# Same data, different k — watch the prediction flip or hold
for k in [1, 3, 5]:
    pred, nbrs = knn_predict((8, 4), training, k=k)
    labels = [lbl for _, lbl in nbrs]
    print(f"k={k}: neighbors={labels}  -> predicted: {pred}")
Common mistake

Feature Scaling is Critical

If one feature is measured in kilometers and another in millimeters, the large-scale feature will dominate the distance calculation and drown out the small-scale one. Always normalize or standardize your features before using k-NN so that each dimension contributes fairly to the distance. This is the #1 mistake beginners make with k-NN.

#Why 'Lazy' Is Actually a Superpower (and a Weakness)

Lazy learning means k-NN skips the training phase entirely — it just stores the data. This has real advantages:

  • Instant updates: Add new training examples any time. The model is always current.
  • No assumptions: k-NN doesn't assume the data is linear, normally distributed, or any particular shape. It adapts to whatever structure the data has.
  • Works out of the box on surprisingly complex decision boundaries.

But there's a cost:

  • Slow at prediction time: You must compute distances to every training point for every new query. With millions of points, this gets expensive. (Libraries like scikit-learn use clever data structures like KD-trees to speed this up.)
  • Memory hungry: The entire training set must live in memory.
Note

k-NN in the Real World

k-NN powers recommendation systems ("customers who bought items similar to yours also bought..."), image recognition, anomaly detection, and medical diagnosis. In scikit-learn, it's just from sklearn.neighbors import KNeighborsClassifier — but now you know exactly what's happening inside that one line.

Quick check

You train a k-NN classifier with k=1 and get perfect accuracy on the training data, but poor accuracy on new test data. What is most likely happening?

Key takeaways

  • k-NN classifies a new point by finding the k closest training examples and taking a majority vote — no math beyond counting and distance.
  • Euclidean distance is the measuring tape: subtract coordinates, square, sum, square-root.
  • Small k = high variance (noisy, jagged boundaries); large k = high bias (smooth but possibly wrong). Tune k with cross-validation.
  • k-NN is a lazy learner: zero training time, but prediction scans the whole dataset — great for small data, slow for huge datasets.
  • Always normalize your features before using k-NN, or a single large-scale feature will dominate all distance calculations.
Try it yourself · Nearest neighbors
Move the new point and see how its k neighbors vote on its class.
k =
prediction
Class A

The square is a new point. It's classified by a majority vote of its 3 nearest neighbors. Click anywhere to move it.

Practice challenges
Test yourself · earn XP
0/4
Predict the output#1

This is the Euclidean distance helper from the lesson. What does it print?

predict-output
import math

def euclidean_distance(a, b):
    return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))

print(euclidean_distance((0, 0), (6, 8)))
Fix the bug#2

This k-NN classifier is supposed to keep the k CLOSEST neighbors, but it predicts the wrong label. What's wrong?

fix-bug
def knn_predict(query, data, k=3):
    distances = []
    for *features, label in data:
        d = euclidean_distance(query, features)
        distances.append((d, label))

    distances.sort(key=lambda x: x[0], reverse=True)
    neighbors = distances[:k]

    votes = {}
    for _, label in neighbors:
        votes[label] = votes.get(label, 0) + 1
    return max(votes, key=votes.get)
Fill in the blank#3

Complete the two core steps of k-NN prediction: after computing distances, sort them ascending, then keep only the k nearest neighbors.

def knn_predict(query, data, k=3):
    distances = []
    for *features, label in data:
        d = euclidean_distance(query, features)
        distances.append((d, label))

    distances.sort(key=lambda x: x[0])
    neighbors = distances[:]

    votes = {}
    for _, label in neighbors:
        votes[label] = votes.get(label, 0) + 1
    return max(votes, key=votes.get)
Reorder the lines#4

Put these lines in the correct order to make a k-NN prediction: compute every distance, sort by nearest, keep the k closest, tally the votes, and return the winning label.

1
distances.sort(key=lambda x: x[0])
2
return max(votes, key=votes.get)
3
distances = [(euclidean_distance(query, feat), lbl) for *feat, lbl in data]
4
votes = {}
5
neighbors = distances[:k]
6
for _, label in neighbors:
    votes[label] = votes.get(label, 0) + 1
Your turn
Practice exercise

You have a small dataset of animals described by two features: weight (kg) and speed (km/h). Each animal is labeled "predator" or "prey".

Complete the knn_classify function below so that it: 1. Computes the Euclidean distance from the query point to every point in data. 2. Sorts the results and picks the k nearest neighbors. 3. Returns the majority-vote label.

Then call it to classify a mystery animal with weight=15 and speed=40, using k=3.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable