k-Nearest Neighbors
Classify anything by asking its neighbors — learn how k-NN works by computing distances and taking a majority vote, all from scratch.
Imagine you just moved to a new city and you're deciding where to eat dinner. You don't know the area, so you look around: the three nearest restaurants all have long lines of happy customers. "If the locals nearby love it, I probably will too." That gut feeling — judge something by what surrounds it — is the entire idea behind k-Nearest Neighbors.
k-NN is one of the most intuitive algorithms in all of machine learning. It classifies a new data point by looking at the k closest labeled points in the training set and letting them vote. Whoever gets the most votes wins. No equations to solve, no mysterious training phase — just memory and a measuring tape.
#The Core Idea: Majority Vote Among Neighbors
Here is k-NN in three sentences:
- Store all your labeled training examples — every point keeps its label.
- When a new, unlabeled point arrives, measure its distance to every stored point.
- Look at the k closest ones and predict whichever label appears most often.
That's it. There is no model, no weights, no optimization loop. k-NN is sometimes called a lazy learner because it does zero work at training time — all the work happens at prediction time.
The Neighborhood Election
Think of each labeled training point as a house with a political sign in the yard — red or blue. When a stranger arrives at an unknown address, they look at the k nearest houses. If 3 out of 5 nearby houses are red, the newcomer is classified as red.
Increase k and you consider a wider neighborhood — opinions become more stable but may blur boundaries. Decrease k and you zoom in tight — opinions are razor-sharp but one noisy neighbor can swing the vote.
#Measuring Distance: The Euclidean Formula
Before we can find the nearest neighbors, we need to define nearness. The most common choice is Euclidean distance — the straight-line distance you'd measure with a ruler.
For two points A = (x1, y1) and B = (x2, y2), the formula is:
`` distance = sqrt( (x2 - x1)² + (y2 - y1)² ) ``
In plain words: subtract each coordinate, square the difference (so negatives become positive), add them up, then take the square root to get back to real-world units. It's just the Pythagorean theorem — the distance formula from geometry class.
import math
def euclidean_distance(a, b):
"""Distance between two points (tuples of any length)."""
return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))
print(euclidean_distance((1, 2), (4, 6))) # classic 3-4-5 triangle
print(euclidean_distance((0, 0), (1, 1)))#Building k-NN from Scratch
Let's build a complete k-NN classifier in pure Python — no external libraries. We'll classify fruits based on two features: sweetness (1–10) and crunchiness (1–10). Each fruit in our training set is labeled "apple" or "orange".
import math
# Training data: (sweetness, crunchiness, label)
training = [
(7, 9, "apple"),
(6, 8, "apple"),
(8, 7, "apple"),
(9, 2, "orange"),
(8, 3, "orange"),
(7, 1, "orange"),
]
def euclidean_distance(a, b):
return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))
def knn_predict(query, data, k=3):
# Step 1: compute distance from query to every training point
distances = []
for *features, label in data:
d = euclidean_distance(query, features)
distances.append((d, label))
# Step 2: sort by distance, keep the k closest
distances.sort(key=lambda x: x[0])
neighbors = distances[:k]
# Step 3: majority vote
votes = {}
for _, label in neighbors:
votes[label] = votes.get(label, 0) + 1
return max(votes, key=votes.get), neighbors
# Classify a mystery fruit: sweetness=7, crunchiness=8
prediction, neighbors = knn_predict((7, 8), training, k=3)
print("Prediction:", prediction)
print("Nearest neighbors:")
for dist, label in neighbors:
print(f" {label} (distance: {dist:.2f})")The Interactive Visualizer
The animation on this page lets you click to place a new point and watch the circles expand outward until they touch k neighbors. Try dragging k up and down to see how the decision boundary changes — it's the fastest way to build intuition for why k matters.
#The Effect of k: Small vs Large
The choice of k is the most important decision in k-NN:
- Small k (e.g., k=1): The model listens only to the single nearest neighbor. It draws very jagged, irregular boundaries. One mislabeled or noisy training point can flip a prediction. The model has high variance — it wiggles a lot.
- Large k (e.g., k=20): The model considers a wide vote. Boundaries become smooth and stable. But if k is too large, it starts ignoring real local patterns and can underfit — it just predicts the most common class everywhere.
- Sweet spot: Try odd values (to avoid ties) in the middle — often somewhere between 3 and 15. Use cross-validation to tune it.
# Same data, different k — watch the prediction flip or hold
for k in [1, 3, 5]:
pred, nbrs = knn_predict((8, 4), training, k=k)
labels = [lbl for _, lbl in nbrs]
print(f"k={k}: neighbors={labels} -> predicted: {pred}")Feature Scaling is Critical
If one feature is measured in kilometers and another in millimeters, the large-scale feature will dominate the distance calculation and drown out the small-scale one. Always normalize or standardize your features before using k-NN so that each dimension contributes fairly to the distance. This is the #1 mistake beginners make with k-NN.
#Why 'Lazy' Is Actually a Superpower (and a Weakness)
Lazy learning means k-NN skips the training phase entirely — it just stores the data. This has real advantages:
- Instant updates: Add new training examples any time. The model is always current.
- No assumptions: k-NN doesn't assume the data is linear, normally distributed, or any particular shape. It adapts to whatever structure the data has.
- Works out of the box on surprisingly complex decision boundaries.
But there's a cost:
- Slow at prediction time: You must compute distances to every training point for every new query. With millions of points, this gets expensive. (Libraries like scikit-learn use clever data structures like KD-trees to speed this up.)
- Memory hungry: The entire training set must live in memory.
k-NN in the Real World
k-NN powers recommendation systems ("customers who bought items similar to yours also bought..."), image recognition, anomaly detection, and medical diagnosis. In scikit-learn, it's just from sklearn.neighbors import KNeighborsClassifier — but now you know exactly what's happening inside that one line.
You train a k-NN classifier with k=1 and get perfect accuracy on the training data, but poor accuracy on new test data. What is most likely happening?
Key takeaways
- k-NN classifies a new point by finding the k closest training examples and taking a majority vote — no math beyond counting and distance.
- Euclidean distance is the measuring tape: subtract coordinates, square, sum, square-root.
- Small k = high variance (noisy, jagged boundaries); large k = high bias (smooth but possibly wrong). Tune k with cross-validation.
- k-NN is a lazy learner: zero training time, but prediction scans the whole dataset — great for small data, slow for huge datasets.
- Always normalize your features before using k-NN, or a single large-scale feature will dominate all distance calculations.
The square is a new point. It's classified by a majority vote of its 3 nearest neighbors. Click anywhere to move it.
This is the Euclidean distance helper from the lesson. What does it print?
import math
def euclidean_distance(a, b):
return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))
print(euclidean_distance((0, 0), (6, 8)))This k-NN classifier is supposed to keep the k CLOSEST neighbors, but it predicts the wrong label. What's wrong?
def knn_predict(query, data, k=3):
distances = []
for *features, label in data:
d = euclidean_distance(query, features)
distances.append((d, label))
distances.sort(key=lambda x: x[0], reverse=True)
neighbors = distances[:k]
votes = {}
for _, label in neighbors:
votes[label] = votes.get(label, 0) + 1
return max(votes, key=votes.get)Complete the two core steps of k-NN prediction: after computing distances, sort them ascending, then keep only the k nearest neighbors.
def knn_predict(query, data, k=3): distances = [] for *features, label in data: d = euclidean_distance(query, features) distances.append((d, label)) distances.sort(key=lambda x: x[0]) neighbors = distances[:] votes = {} for _, label in neighbors: votes[label] = votes.get(label, 0) + 1 return max(votes, key=votes.get)
Put these lines in the correct order to make a k-NN prediction: compute every distance, sort by nearest, keep the k closest, tally the votes, and return the winning label.
distances.sort(key=lambda x: x[0])
return max(votes, key=votes.get)
distances = [(euclidean_distance(query, feat), lbl) for *feat, lbl in data]
votes = {}neighbors = distances[:k]
for _, label in neighbors:
votes[label] = votes.get(label, 0) + 1You have a small dataset of animals described by two features: weight (kg) and speed (km/h). Each animal is labeled "predator" or "prey".
Complete the knn_classify function below so that it: 1. Computes the Euclidean distance from the query point to every point in data. 2. Sorts the results and picks the k nearest neighbors. 3. Returns the majority-vote label.
Then call it to classify a mystery animal with weight=15 and speed=40, using k=3.
Try it live — edit the code and hit Run to execute real Python: