Model QualityIntermediate9 min10 / 13

Overfitting & Underfitting

Discover why a model that aces its practice tests can still fail the real exam — and how to find the sweet spot that actually generalizes.

Picture a student cramming for a history exam. One student barely studied — they only remember "wars happen and things change", so they get nearly everything wrong (underfitting). Another memorized every question from last year's practice paper, word for word. They ace the practice test, but the real exam asks things slightly differently and they're lost (overfitting). The best student learned the actual patterns — causes, key players, consequences — and can answer questions they've never seen before. That's the goal of every machine learning model: generalization.

#Two Ways a Model Can Fail

When we train a model, we want it to learn the underlying pattern — not just the specific examples it was shown. There are two opposite failure modes:

  • Underfitting (high bias) — the model is too simple. It misses real patterns that exist in the data. Think of describing a curvy mountain road using only a perfectly straight line. No matter how much data you give it, a too-simple model can't represent the real structure.
  • Overfitting (high variance) — the model is too complex. It learns the training data too well, including all the random noise and quirks that won't appear again in new data. Change a few training points and the model changes dramatically.

Both failures hurt real-world usefulness. The sweet spot in between — low bias and low variance — is called good generalization. The tension between these goals is the bias-variance tradeoff, one of the most fundamental ideas in machine learning.

Think of it like

The Tailor Analogy

An underfitter makes one coat in a single "medium" size — it fits nobody well. An overfitter measures every wrinkle on one specific customer and makes a coat that fits only that person perfectly. A good tailor captures the handful of measurements that matter for any customer — shoulders, chest, waist, length. Not too few, not too many.

#The Training Error vs. Test Error Curve

Here is the most important chart in all of model evaluation. As model complexity increases:

  • Training error always goes down. A complex enough model can memorize anything.
  • Test error (data the model never saw) follows a U-shape: starts high (underfitting), drops to a sweet spot, then rises again (overfitting).

The gap between training error and test error is the generalization gap. When that gap is large, your model has memorized rather than learned.

Training error keeps falling as complexity grows. Test error has a sweet spot around complexity 3-4, then rises as the model starts memorizing noise.
# Simulating the classic training vs. test error pattern
complexity = [1,   2,    3,    4,    5,    6,    7,    8]
train_err  = [4.8, 1.2,  0.6,  0.3,  0.15, 0.08, 0.04, 0.01]
test_err   = [5.1, 1.4,  1.0,  1.1,  1.3,  1.8,  2.5,  4.2]

print(f"{'Complexity':<12} {'Train Err':<12} {'Test Err':<12} {'Gap':<10} {'Status'}")
print("-" * 60)
for c, tr, te in zip(complexity, train_err, test_err):
    gap = te - tr
    if c <= 2:
        status = "Underfitting"
    elif 2 < c <= 4:
        status = "Sweet spot!"
    else:
        status = "Overfitting"
    print(f"{c:<12} {tr:<12.2f} {te:<12.2f} {gap:<10.2f} {status}")
Common mistake

A Perfect Training Score Is a Red Flag

If your model scores 100% on training data, do not celebrate — be suspicious. Real data always has noise. A model that fits it perfectly has fit the noise, not the signal. The honest score is always on data the model has never seen. Training accuracy alone tells you almost nothing.

#How to Fix Each Problem

Fighting underfitting (model too simple): - Use a more expressive model (deeper tree, higher polynomial degree, neural network) - Engineer better or more informative features - Train longer if the algorithm is iterative

Fighting overfitting (model too complex): - Get more training data — the single most effective fix; more data makes noise harder to memorize - Simplify the model — fewer parameters, shallower tree, lower polynomial degree - Regularization — add a penalty to the loss function for large model weights (L1 / L2), discouraging wildly complex behavior - Cross-validation — rotate which held-out chunk is the test set and average the scores for a more reliable picture of true performance - Dropout (neural networks) — randomly disable neurons during training so no single path gets over-relied on

Tip

Cross-Validation in One Line

Instead of one fixed train/test split, k-fold cross-validation divides your data into k chunks (say 5), trains on 4 and tests on the remaining 1, repeats 5 times, and averages the scores. In scikit-learn: cross_val_score(model, X, y, cv=5). The averaged score is far more trustworthy than any single split because it tests the model against every part of your data.

#Seeing It in Code: An Honest Train/Test Split

A constant prediction is maximally underfit — both errors are high. Even here a gap exists, and that gap grows dramatically when a complex model memorizes noise.
import random

def train_test_split(data, test_fraction=0.2, seed=42):
    random.seed(seed)
    shuffled = data[:]
    random.shuffle(shuffled)
    split = int(len(shuffled) * (1 - test_fraction))
    return shuffled[:split], shuffled[split:]

def mean_squared_error(actuals, preds):
    return sum((a - p)**2 for a, p in zip(actuals, preds)) / len(actuals)

# Noisy quadratic data — true pattern is y = x^2
random.seed(7)
dataset = [(x, x**2 + random.gauss(0, 4)) for x in range(1, 25)]
train, test = train_test_split(dataset, test_fraction=0.25)

# Most underfit model: predict the training mean for everyone
mean_y = sum(y for _, y in train) / len(train)
train_mse = mean_squared_error([y for _, y in train], [mean_y]*len(train))
test_mse  = mean_squared_error([y for _, y in test],  [mean_y]*len(test))

print(f"Training points: {len(train)}, Test points: {len(test)}")
print(f"Constant-prediction train MSE: {train_mse:.1f}")
print(f"Constant-prediction test  MSE: {test_mse:.1f}")
print(f"Generalization gap:            {test_mse - train_mse:.1f}")
Quick check

A model scores 97% accuracy on training data but only 59% on test data. What is most likely happening, and what is the best first fix to try?

Key takeaways

  • Underfitting (high bias) means the model is too simple — it misses real patterns even in training data.
  • Overfitting (high variance) means the model memorized training noise — it fails on new data despite a high training score.
  • The generalization gap (test error minus training error) is your most honest signal: a large gap means overfitting.
  • More data, simpler models, and regularization are the primary tools for fighting overfitting.
  • Always evaluate on held-out data the model never saw during training — training accuracy alone tells you almost nothing.
Try it yourself · Complexity dial
Slide from too-simple to too-complex and find the sweet spot.
Good fit — captures the trend

A too-simple model misses the pattern; a too-complex one wiggles through every noisy point and won't generalize. The sweet spot is in the middle.

Practice challenges
Test yourself · earn XP
0/4
Predict the output#1

Using the lesson's train-vs-test framing, this prints the generalization gap for each model. What does it print?

predict-output
train_err = [0.60, 0.15, 0.01]
test_err  = [1.00, 1.30, 4.20]

for tr, te in zip(train_err, test_err):
    gap = te - tr
    print(f"gap: {gap:.2f}")
Fix the bug#2

This code has a bug — what's wrong?

fix-bug
train, test = train_test_split(dataset, test_fraction=0.25)

model = fit(train)

# Report how good the model is
preds = [model.predict(x) for x, _ in train]
score = mean_squared_error([y for _, y in train], preds)
print(f"Model MSE: {score:.2f}")
Fill in the blank#3

Complete the scikit-learn call from the lesson so it runs 5-fold cross-validation. Fill in the argument value that sets the number of folds.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=)
print(scores.mean())
Reorder the lines#4

Put these lines in the correct order to honestly measure the generalization gap, following the lesson's from-scratch approach.

1
model = fit(train)
2
train_mse = mean_squared_error([y for _, y in train], [model(x) for x, _ in train])
3
train, test = train_test_split(dataset, test_fraction=0.25)
4
print(f"Generalization gap: {test_mse - train_mse:.1f}")
5
test_mse  = mean_squared_error([y for _, y in test],  [model(x) for x, _ in test])
Your turn
Practice exercise

Implement two model functions and an evaluation harness. Write fit_constant(train) that returns the mean y of the training set (the most underfit model possible). Write fit_linear(train) that fits y = ax + b using the least-squares formula. Write `evaluate(dataset, model_fn, label)` that splits 80/20, fits the model on train, predicts on both train and test, and prints train MSE, test MSE, and the gap. Run both models on a dataset where y = 2x + 3 + noise and observe how the linear model dramatically reduces both errors and the gap.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable