Overfitting & Underfitting
Discover why a model that aces its practice tests can still fail the real exam — and how to find the sweet spot that actually generalizes.
Picture a student cramming for a history exam. One student barely studied — they only remember "wars happen and things change", so they get nearly everything wrong (underfitting). Another memorized every question from last year's practice paper, word for word. They ace the practice test, but the real exam asks things slightly differently and they're lost (overfitting). The best student learned the actual patterns — causes, key players, consequences — and can answer questions they've never seen before. That's the goal of every machine learning model: generalization.
#Two Ways a Model Can Fail
When we train a model, we want it to learn the underlying pattern — not just the specific examples it was shown. There are two opposite failure modes:
- Underfitting (high bias) — the model is too simple. It misses real patterns that exist in the data. Think of describing a curvy mountain road using only a perfectly straight line. No matter how much data you give it, a too-simple model can't represent the real structure.
- Overfitting (high variance) — the model is too complex. It learns the training data too well, including all the random noise and quirks that won't appear again in new data. Change a few training points and the model changes dramatically.
Both failures hurt real-world usefulness. The sweet spot in between — low bias and low variance — is called good generalization. The tension between these goals is the bias-variance tradeoff, one of the most fundamental ideas in machine learning.
The Tailor Analogy
An underfitter makes one coat in a single "medium" size — it fits nobody well. An overfitter measures every wrinkle on one specific customer and makes a coat that fits only that person perfectly. A good tailor captures the handful of measurements that matter for any customer — shoulders, chest, waist, length. Not too few, not too many.
#The Training Error vs. Test Error Curve
Here is the most important chart in all of model evaluation. As model complexity increases:
- Training error always goes down. A complex enough model can memorize anything.
- Test error (data the model never saw) follows a U-shape: starts high (underfitting), drops to a sweet spot, then rises again (overfitting).
The gap between training error and test error is the generalization gap. When that gap is large, your model has memorized rather than learned.
# Simulating the classic training vs. test error pattern
complexity = [1, 2, 3, 4, 5, 6, 7, 8]
train_err = [4.8, 1.2, 0.6, 0.3, 0.15, 0.08, 0.04, 0.01]
test_err = [5.1, 1.4, 1.0, 1.1, 1.3, 1.8, 2.5, 4.2]
print(f"{'Complexity':<12} {'Train Err':<12} {'Test Err':<12} {'Gap':<10} {'Status'}")
print("-" * 60)
for c, tr, te in zip(complexity, train_err, test_err):
gap = te - tr
if c <= 2:
status = "Underfitting"
elif 2 < c <= 4:
status = "Sweet spot!"
else:
status = "Overfitting"
print(f"{c:<12} {tr:<12.2f} {te:<12.2f} {gap:<10.2f} {status}")A Perfect Training Score Is a Red Flag
If your model scores 100% on training data, do not celebrate — be suspicious. Real data always has noise. A model that fits it perfectly has fit the noise, not the signal. The honest score is always on data the model has never seen. Training accuracy alone tells you almost nothing.
#How to Fix Each Problem
Fighting underfitting (model too simple): - Use a more expressive model (deeper tree, higher polynomial degree, neural network) - Engineer better or more informative features - Train longer if the algorithm is iterative
Fighting overfitting (model too complex): - Get more training data — the single most effective fix; more data makes noise harder to memorize - Simplify the model — fewer parameters, shallower tree, lower polynomial degree - Regularization — add a penalty to the loss function for large model weights (L1 / L2), discouraging wildly complex behavior - Cross-validation — rotate which held-out chunk is the test set and average the scores for a more reliable picture of true performance - Dropout (neural networks) — randomly disable neurons during training so no single path gets over-relied on
Cross-Validation in One Line
Instead of one fixed train/test split, k-fold cross-validation divides your data into k chunks (say 5), trains on 4 and tests on the remaining 1, repeats 5 times, and averages the scores. In scikit-learn: cross_val_score(model, X, y, cv=5). The averaged score is far more trustworthy than any single split because it tests the model against every part of your data.
#Seeing It in Code: An Honest Train/Test Split
import random
def train_test_split(data, test_fraction=0.2, seed=42):
random.seed(seed)
shuffled = data[:]
random.shuffle(shuffled)
split = int(len(shuffled) * (1 - test_fraction))
return shuffled[:split], shuffled[split:]
def mean_squared_error(actuals, preds):
return sum((a - p)**2 for a, p in zip(actuals, preds)) / len(actuals)
# Noisy quadratic data — true pattern is y = x^2
random.seed(7)
dataset = [(x, x**2 + random.gauss(0, 4)) for x in range(1, 25)]
train, test = train_test_split(dataset, test_fraction=0.25)
# Most underfit model: predict the training mean for everyone
mean_y = sum(y for _, y in train) / len(train)
train_mse = mean_squared_error([y for _, y in train], [mean_y]*len(train))
test_mse = mean_squared_error([y for _, y in test], [mean_y]*len(test))
print(f"Training points: {len(train)}, Test points: {len(test)}")
print(f"Constant-prediction train MSE: {train_mse:.1f}")
print(f"Constant-prediction test MSE: {test_mse:.1f}")
print(f"Generalization gap: {test_mse - train_mse:.1f}")A model scores 97% accuracy on training data but only 59% on test data. What is most likely happening, and what is the best first fix to try?
Key takeaways
- Underfitting (high bias) means the model is too simple — it misses real patterns even in training data.
- Overfitting (high variance) means the model memorized training noise — it fails on new data despite a high training score.
- The generalization gap (test error minus training error) is your most honest signal: a large gap means overfitting.
- More data, simpler models, and regularization are the primary tools for fighting overfitting.
- Always evaluate on held-out data the model never saw during training — training accuracy alone tells you almost nothing.
A too-simple model misses the pattern; a too-complex one wiggles through every noisy point and won't generalize. The sweet spot is in the middle.
Using the lesson's train-vs-test framing, this prints the generalization gap for each model. What does it print?
train_err = [0.60, 0.15, 0.01]
test_err = [1.00, 1.30, 4.20]
for tr, te in zip(train_err, test_err):
gap = te - tr
print(f"gap: {gap:.2f}")This code has a bug — what's wrong?
train, test = train_test_split(dataset, test_fraction=0.25)
model = fit(train)
# Report how good the model is
preds = [model.predict(x) for x, _ in train]
score = mean_squared_error([y for _, y in train], preds)
print(f"Model MSE: {score:.2f}")Complete the scikit-learn call from the lesson so it runs 5-fold cross-validation. Fill in the argument value that sets the number of folds.
from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=) print(scores.mean())
Put these lines in the correct order to honestly measure the generalization gap, following the lesson's from-scratch approach.
model = fit(train)
train_mse = mean_squared_error([y for _, y in train], [model(x) for x, _ in train])
train, test = train_test_split(dataset, test_fraction=0.25)
print(f"Generalization gap: {test_mse - train_mse:.1f}")test_mse = mean_squared_error([y for _, y in test], [model(x) for x, _ in test])
Implement two model functions and an evaluation harness. Write fit_constant(train) that returns the mean y of the training set (the most underfit model possible). Write fit_linear(train) that fits y = ax + b using the least-squares formula. Write `evaluate(dataset, model_fn, label)` that splits 80/20, fits the model on train, predicts on both train and test, and prints train MSE, test MSE, and the gap. Run both models on a dataset where y = 2x + 3 + noise and observe how the linear model dramatically reduces both errors and the gap.
Try it live — edit the code and hit Run to execute real Python: