Model QualityIntermediate10 min11 / 13

Evaluating Models

Learn how to tell whether a machine learning model is actually good — and why a naive 99% accuracy score can be a complete lie.

Imagine you build a model to detect credit card fraud. You train it, run it on test data, and it scores 99% accuracy. Champagne time, right?

Not so fast. If only 1% of transactions are actually fraudulent, a model that labels every single transaction as "not fraud" would also score 99% — and it would catch zero fraudsters. That model is completely useless, yet the number looks great.

This is one of the most important lessons in machine learning: the metric you use to judge a model matters enormously. Accuracy is the simplest metric — the fraction of predictions that were correct — and it works fine when classes are balanced. But when one class is much rarer (a class imbalance), accuracy becomes meaningless. Fraud detection, disease diagnosis, and spam filtering are all classic imbalanced problems that need better tools.

A simple loop is all you need to compute accuracy yourself.
# Computing accuracy from scratch — no libraries needed
predictions = [0, 0, 1, 0, 1, 1, 0, 0, 1, 0]
true_labels  = [0, 0, 1, 0, 0, 1, 1, 0, 1, 0]

correct = sum(p == t for p, t in zip(predictions, true_labels))
accuracy = correct / len(true_labels)
print(f"Accuracy: {accuracy:.1%}")

#The Confusion Matrix: Seeing What Really Happened

To see past accuracy, we break predictions into four buckets:

  • True Positive (TP) — Model said YES, and it really was YES. (Caught a real fraudster!)
  • True Negative (TN) — Model said NO, and it really was NO. (Correctly cleared an innocent customer.)
  • False Positive (FP) — Model said YES, but it was actually NO. (Wrongly flagged a good transaction.)
  • False Negative (FN) — Model said NO, but it was actually YES. (Missed a real fraud — often the worst mistake.)

Arranging these four numbers in a grid gives you the confusion matrix — a complete picture of exactly where your model succeeds and fails.

Think of it like

The Smoke Alarm Analogy

Think of a smoke alarm. A False Positive is a burnt-toast alarm — annoying, but safe. A False Negative is a real fire with no alarm — potentially catastrophic.

Depending on the context, one type of error is far worse than the other. Cancer screening should almost never produce False Negatives (missed cancers). A spam filter should avoid too many False Positives (deleting real emails). The confusion matrix shows you which errors you're making, not just how many.

Four numbers tell you everything about your binary classifier.
def confusion_matrix(y_true, y_pred):
    TP = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
    TN = sum(t == 0 and p == 0 for t, p in zip(y_true, y_pred))
    FP = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
    FN = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))
    return TP, TN, FP, FN

y_true = [0, 0, 1, 0, 0, 1, 1, 0, 1, 0]
y_pred = [0, 0, 1, 0, 1, 1, 0, 0, 1, 0]
TP, TN, FP, FN = confusion_matrix(y_true, y_pred)
print(f"TP={TP}, TN={TN}, FP={FP}, FN={FN}")

#Precision and Recall: Two Different Questions

Precision asks: "Of everything flagged as positive, how many really were?"TP / (TP + FP). High precision means fewer false alarms.

Recall asks: "Of all real positives, how many did the model catch?"TP / (TP + FN). High recall means fewer missed cases.

These trade off against each other. Making a model more aggressive raises recall but lowers precision. Being conservative does the opposite. The right balance depends on your problem. When you want one balanced number, the F1 score combines both: 2 * P * R / (P + R).

Common mistake

You Cannot Maximize Both at Once

It is tempting to chase both precision = 1.0 and recall = 1.0 simultaneously. In practice, improving one almost always hurts the other. If you lower your model's decision threshold (flag more things), recall goes up but precision goes down — and vice versa. The F1 score helps you find a good middle ground, but always inspect which errors your model makes.

Precision, recall, and F1 — three complementary views of classifier quality.
def metrics(TP, FP, FN):
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall    = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    return precision, recall, f1

p, r, f1 = metrics(TP=3, FP=1, FN=1)
print(f"Precision: {p:.2f}  Recall: {r:.2f}  F1: {f1:.2f}")

#Train/Test Split, Cross-Validation, and Regression Metrics

Even perfect metrics are useless if you measure them on the training data. A model could memorize training examples and score 100% — then fail on anything new. This is overfitting.

The fix: a train/test split. Before training, set aside ~20% of your data as a held-out test set. Evaluate only on that. If you keep tweaking the model to improve test-set scores, you're leaking information there too. k-Fold Cross-Validation solves this by rotating the test set k times across the data and averaging results — giving a much more reliable estimate without wasting data.

For regression (predicting numbers, not categories) use: - MAE (Mean Absolute Error) — average absolute difference; easy to interpret - RMSE (Root Mean Squared Error) — punishes large errors more harshly

Use MAE when outliers are common; use RMSE when big errors are especially costly.

Quick check

A model classifies emails as spam (1) or not spam (0). It achieves 98% accuracy, but inspection reveals it labels almost every email as not-spam. Which metric best exposes this failure?

Key takeaways

  • Accuracy lies on imbalanced data — a model predicting only the majority class can score 99% while being completely useless.
  • The confusion matrix (TP, TN, FP, FN) gives a full picture of classifier errors beyond a single number.
  • Precision asks 'how many flagged positives were real?'; recall asks 'how many real positives were caught?' — they trade off, and the right balance depends on the problem.
  • Always evaluate on a held-out test set or use cross-validation — training-set scores tell you nothing about real-world performance.
  • For regression, use MAE for interpretability and RMSE when large errors deserve extra punishment.
Practice challenges
Test yourself · earn XP
0/4
Predict the output#1

This snippet computes accuracy from scratch, just like in the lesson. What does it print?

predict-output
predictions = [1, 1, 0, 1, 0]
true_labels  = [1, 0, 0, 1, 0]

correct = sum(p == t for p, t in zip(predictions, true_labels))
accuracy = correct / len(true_labels)
print(f"Accuracy: {accuracy:.1%}")
Fix the bug#2

This code has a bug — what's wrong?

fix-bug
def confusion_matrix(y_true, y_pred):
    TP = sum(t == 1 and p == 1 for t, p in zip(y_true, y_pred))
    TN = sum(t == 0 and p == 0 for t, p in zip(y_true, y_pred))
    FP = sum(t == 1 and p == 0 for t, p in zip(y_true, y_pred))
    FN = sum(t == 0 and p == 1 for t, p in zip(y_true, y_pred))
    return TP, TN, FP, FN
Fill in the blank#3

Fill in the blank so this function correctly computes recall — 'of all real positives, how many did the model catch?'

def recall(TP, FN):
    return TP / (TP + ) if (TP + FN) > 0 else 0

print(recall(TP=3, FN=1))  # 0.75
Reorder the lines#4

Put these lines in the correct order to properly evaluate a model without overfitting: split the data, train only on training data, then measure on the held-out test set.

1
model.fit(train_X, train_y)
2
print(accuracy(test_y, predictions))
3
predictions = model.predict(test_X)
4
train_X, test_X, train_y, test_y = split(X, y, test_size=0.2)
Your turn
Practice exercise

You are given two lists: y_true (the real labels) and y_pred (model predictions), both containing 0s and 1s.

Write a function evaluate(y_true, y_pred) that prints: 1. Accuracy (as a percentage, e.g. 73.33%) 2. Precision (2 decimal places) 3. Recall (2 decimal places) 4. F1 Score (2 decimal places)

Test it on the lists provided.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable