Supervised LearningIntermediate⏱ 9 min05 / 13

Gradient Descent

Discover how machine learning models find their best settings by repeatedly nudging in the direction that shrinks their mistakes.

Every time you ask a recommendation engine "what should I watch next?" or let your phone autocorrect your typo, a trained model is doing the work. But how did that model get so good? It didn't start out smart — it started out guessing randomly. Then, over thousands of tiny adjustments, it learned.

The engine behind almost all of that learning is called gradient descent. It's the workhorse of machine learning, and once you understand it, you'll see it everywhere.

#The Problem: Finding the Best Parameters

A machine learning model has parameters — numbers it uses to make predictions. A simple linear model might have just a slope and an intercept; a neural network might have millions. The question is: what values should those parameters have?

We measure how wrong a model is using a loss function (also called a cost or error function). A common one is the squared difference between the model's prediction and the true answer. When the loss is zero, the model is perfect. Our goal is to find the parameters that make the loss as small as possible.

Think of it like

The Blindfolded Hiker

Imagine you're blindfolded on a hilly landscape, and your goal is to reach the lowest valley. You can't see the whole terrain — you can only feel which direction the ground slopes beneath your feet. So you take a step in the downhill direction, feel the slope again, take another step... and keep going until the ground feels flat. That's gradient descent. The landscape is your loss function. The valley is the minimum loss. Your steps are parameter updates.

#What Is a Gradient?

The word "gradient" sounds intimidating, but it's just a fancy word for slope. For a function of one variable, the gradient at a point tells you: "if I increase x a tiny bit, does the function go up or down, and by how much?"

If the gradient is positive, the function is rising to the right — so we should step left (decrease x) to go downhill.
If the gradient is negative, the function is falling to the right — so we should step right (increase x) to go downhill.

In both cases, we move in the opposite direction of the gradient — hence "descent".

#A Concrete Example: Minimizing f(x) = (x - 3)²

Let's use the simplest possible loss function: f(x) = (x - 3)**2. This is a parabola (U-shape) with its lowest point — its minimum — at x = 3 (where f(3) = 0).

The gradient (slope) of this function is f'(x) = 2 * (x - 3). If x is currently 8, the gradient is 2 * (8 - 3) = 10 — large and positive, telling us we're to the right of the minimum and should move left.

x starts at 8 and creeps toward the true minimum at 3. Each step, the loss shrinks.

1
2
3
4
5
6
7
8
9
10
11
12
13def loss(x):
    return (x - 3) ** 2

def gradient(x):
    return 2 * (x - 3)

x = 8.0          # start far from the answer
lr = 0.1         # learning rate

for step in range(10):
    grad = gradient(x)
    x = x - lr * grad   # step downhill
    print(f"Step {step+1}: x = {x:.4f}, loss = {loss(x):.4f}")

#The Learning Rate: Step Size Matters

The learning rate (often written as lr or α) controls how big each step is. It's one of the most important choices in machine learning.

Too large: you overshoot the minimum and bounce back and forth, or even diverge (loss goes up instead of down).
Too small: convergence is painfully slow — you'll get there eventually, but it might take millions of steps.
Just right: you converge smoothly in a reasonable number of steps.

Typical learning rates are small numbers like 0.01, 0.001, or 0.1. Finding a good one is part art, part science.

With lr=0.9 we converge fast (this parabola is forgiving); with lr=0.01, 20 steps barely move us after 20 iterations.

1
2
3
4
5
6
7
8def loss(x): return (x - 3) ** 2
def gradient(x): return 2 * (x - 3)

for lr, label in [(0.9, 'too big'), (0.1, 'just right'), (0.01, 'too small')]:
    x = 8.0
    for _ in range(20):
        x -= lr * gradient(x)
    print(f"lr={lr} ({label:10s}): final x = {x:.4f}")

Common mistake

Gradient Descent Doesn't Always Find the Global Minimum

Our parabola has only one valley, so gradient descent always finds the true answer. But real-world loss functions — especially for deep neural networks — are bumpy landscapes with many valleys (local minima). Gradient descent can get stuck in a small dip instead of finding the deepest one. Techniques like momentum, random restarts, and careful initialization help, but this remains an open challenge in ML research.

#Iterations and Epochs

Each individual update to the parameters is called an iteration or step. When training on a full dataset, one pass through all the training data is called an epoch. You typically run many epochs — 10, 50, 100, or more — watching the loss decrease each time.

In practice, libraries like scikit-learn, PyTorch, and TensorFlow handle the gradient calculations automatically (using a technique called automatic differentiation). But the underlying algorithm is exactly the loop you've seen here: compute the gradient, update the parameters, repeat.

Tip

Variants of Gradient Descent

Batch gradient descent uses the whole dataset to compute each gradient update — accurate but slow on large datasets.

Stochastic gradient descent (SGD) uses one random sample per update — noisy but fast.

Mini-batch gradient descent uses a small batch (e.g., 32 or 64 samples) per update — the sweet spot used in most modern deep learning.

#Putting It All Together

Here's the full gradient descent loop in 10 clean lines. Notice how mechanically simple it is — the magic is entirely in the gradient telling us which direction to go:

Start with a random (or guessed) parameter value.
Compute the loss — how wrong are we right now?
Compute the gradient — which direction does loss increase?
Step in the opposite direction — move downhill by lr * gradient.
Repeat until the loss is small enough or we've run enough iterations.

By epoch 30 we're within 0.006 of the true answer (3.0). More epochs → more precision.

1
2
3
4
5
6
7
8
9# Full gradient descent — converge from x=8 to x=3
def loss(x):     return (x - 3) ** 2
def gradient(x): return 2 * (x - 3)

x, lr, epochs = 8.0, 0.1, 30
for epoch in range(1, epochs + 1):
    x -= lr * gradient(x)
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d} | x = {x:.6f} | loss = {loss(x):.6f}")

Quick check

Your gradient descent loop is running, but the loss keeps jumping up and down wildly and never settles. What is the most likely cause?

Note

From One Parameter to Millions

We worked with a single number x, but the same idea scales to any number of parameters. With multiple parameters, the gradient becomes a vector (a list of slopes, one per parameter), and we subtract the whole vector at once. Neural networks with millions of parameters do exactly this — the math is identical, just higher-dimensional.

Key takeaways

Gradient descent finds the best model parameters by repeatedly stepping in the direction that reduces the loss function.
The gradient is just the slope of the loss — always step in the **opposite** direction of the gradient to go downhill.
The learning rate controls step size: too large overshoots, too small converges painfully slowly.
One full pass through the training data is called an epoch; you typically run many epochs.
Real loss landscapes can have local minima, so gradient descent isn't guaranteed to find the global best — but it works remarkably well in practice.

Try it yourself · Roll downhill

Step down the loss curve to the minimum — try different learning rates.

learning rate0.12

0.60

loss

17.42

The ball takes steps downhill (opposite the slope) to reach the lowest loss. Too high a learning rate and it overshoots and bounces; too low and it crawls.

step 1 / 25

Practice challenges

Test yourself · earn XP

0/4

Predict the output#1

This runs a single gradient descent step on the lesson's loss f(x) = (x - 3)**2, starting from x = 5.0. What does it print?

predict-output

1
2
3
4
5
6
7
8
9
10def loss(x):
    return (x - 3) ** 2

def gradient(x):
    return 2 * (x - 3)

x = 5.0
lr = 0.1
x = x - lr * gradient(x)
print(f"x = {x:.1f}, loss = {loss(x):.2f}")

Fix the bug#2

This code has a bug — what's wrong?

fix-bug

1
2
3
4
5
6
7
8
9
10
11def loss(x):
    return (x - 3) ** 2

def gradient(x):
    return 2 * (x - 3)

x = 8.0
lr = 0.1
for _ in range(30):
    x = x + lr * gradient(x)   # step
print(f"final x = {x:.2f}")

Fill in the blank#3

Complete the gradient descent update rule so x moves downhill toward the minimum of the loss.

def gradient(x):
    return 2 * (x - 3)

x = 8.0
lr = 0.1
for _ in range(30):
    x = x  lr * gradient(x)

Reorder the lines#4

Put these lines in the correct order to build one full gradient descent step inside the loop (from the lesson's 5-step recipe).

current_loss = loss(x)

print(f"loss = {current_loss:.4f}")

grad = gradient(x)

x = x - lr * grad

Your turn

Practice exercise

Implement gradient descent to minimize f(x) = (x - 7) ** 2 starting from x = 0.0. Use a learning rate of 0.15 and run for 25 iterations. Print the value of x and the loss every 5 steps. The minimum is at x = 7 with loss = 0.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable

def loss(x):
    return (x - 7) ** 2

def gradient(x):
    # TODO: return the derivative of (x - 7)**2
    return 0.0

x = 0.0
lr = 0.15

for step in range(1, 26):
    x -= lr * gradient(x)
    if step % 5 == 0:
        print(f"Step {step:2d} | x = {x:.4f} | loss = {loss(x):.4f}")