How Networks Learn
Watch a neural network go from wild random guesses to confident predictions — by measuring its mistakes, tracing blame backwards, and nudging every weight downhill, thousands of times.
When AlphaGo beat the world champion at Go in 2016, it hadn't been programmed with any strategy. Nobody hand-coded "prefer the upper-left corner" or "sacrifice this stone to gain that one". Instead, the network learned — from scratch — by playing millions of games, making errors, and adjusting.
How does a program actually learn from mistakes? The answer involves three ideas working in a loop: measure the error, trace blame backwards, take one tiny downhill step. Repeat that loop tens of thousands of times and a tangle of random numbers becomes a system that can beat world champions. Let's take it apart.
#Step 1 — The Forward Pass: Making a Guess
You already know how a neural network makes a prediction: data flows in through the input layer, gets multiplied by weights, passes through activations, and emerges at the output layer as a number (or a list of numbers). This is called the forward pass — data flows forward, left to right.
At the start of training, every weight is a small random number. So the network's first prediction is basically a guess. For a problem like "is this email spam or not?", the network might spit out 0.43 — nearly a coin flip. We need to know how wrong that is, which brings us to the loss.
#Step 2 — The Loss: Putting a Number on "How Wrong"
A loss function (sometimes called a cost function) converts the gap between the network's guess and the correct answer into a single number. Small loss = close guess. Large loss = terrible guess.
The most common loss for regression problems is Mean Squared Error (MSE): square the difference between each prediction and its true value, then average across all examples. Squaring does two things — it makes all errors positive, and it punishes big mistakes far more than small ones.
For a single example: loss = (prediction - true_value) ** 2
For a batch of examples, just average those squared errors. The goal of training is simple: drive this number as close to zero as possible.
def mse_loss(predictions, targets):
"""Mean Squared Error: average of squared differences."""
total = sum((p - t) ** 2 for p, t in zip(predictions, targets))
return total / len(predictions)
# The network predicted these values; correct answers follow
predictions = [0.43, 0.91, 0.12]
targets = [1.0, 1.0, 0.0 ]
loss = mse_loss(predictions, targets)
print(f"Loss: {loss:.4f}")
# After some training, predictions improve:
better_preds = [0.82, 0.95, 0.07]
print(f"Loss after training: {mse_loss(better_preds, targets):.4f}")#Step 3 — Backpropagation: Assigning Blame
The Restaurant Analogy
Imagine a restaurant where every dish is prepared by a chain of cooks: Cook A preps the ingredients, hands them to Cook B who seasons, who hands to Cook C who plates. The dish goes out wrong. The manager needs to figure out who is most responsible for the mistake — was it A's chopping, B's seasoning, or C's plating?
Backpropagation does exactly this for a neural network. The loss is "the dish went wrong". The manager walks backwards through the kitchen — starting at Cook C, then B, then A — and assigns each cook a blame score (a gradient) that says how much their actions contributed to the final error. Cooks who contributed a lot get a bigger nudge to change their technique.
Technically, backpropagation uses the chain rule of calculus — but you don't need to know calculus to understand the idea. The chain rule just says: the total blame for an early weight equals its direct effect on the next layer, times that layer's effect on the next, times... all the way to the loss. Blame multiplies along the path.
For each weight in the network, backpropagation computes a gradient — a number that answers: "if I nudge this weight up by a tiny amount, does the loss go up or down, and by how much?" A positive gradient means the weight is pulling the loss up; a negative gradient means it's pushing the loss down.
# A minimal 1-neuron example: compute gradient by hand
# Network: output = weight * input
# Loss: (output - target) ** 2
def forward(w, x):
return w * x
def loss(output, target):
return (output - target) ** 2
def gradient_of_loss_wrt_weight(w, x, target):
# d(loss)/d(w) = 2 * (output - target) * x
output = forward(w, x)
return 2 * (output - target) * x
w = 0.3 # initial random weight
x = 1.5 # input feature
target = 1.0
output = forward(w, x)
print(f"Output: {output:.3f}, Loss: {loss(output, target):.3f}")
grad = gradient_of_loss_wrt_weight(w, x, target)
print(f"Gradient: {grad:.3f} (positive = weight is too low, raise it)")#Step 4 — Gradient Descent: Taking the Downhill Step
Now that we know each weight's gradient, we update every weight using one simple rule:
new_weight = old_weight - learning_rate * gradient
The learning rate (often written as lr or α) controls how big a step we take. Think of it as the step size: - Too large: we overshoot the valley and bounce around, never settling. - Too small: training takes forever. - Just right: we slide steadily downhill.
This update rule is called gradient descent — we descend the loss landscape by following the steepest downhill direction. Do it once for every example (or batch of examples), and that's one training iteration. Do it thousands of times across the full dataset — each full pass is called an epoch — and the weights converge to values that minimize the loss.
# Full training loop: one weight, one input, gradient descent
def train_one_neuron(x, target, epochs=20, lr=0.1):
w = 0.3 # start with a random weight
for epoch in range(epochs):
output = w * x
current_loss = (output - target) ** 2
grad = 2 * (output - target) * x
w = w - lr * grad # gradient descent step
if epoch % 5 == 0:
print(f"Epoch {epoch:2d} | w={w:.4f} | loss={current_loss:.4f}")
return w
final_w = train_one_neuron(x=1.5, target=1.0)
print(f"\nFinal weight: {final_w:.4f} (ideal would be ~{1.0/1.5:.4f})")Connecting to the Gradient Descent Visualizer
If you've used the gradient descent visualizer in this course, you've seen a loss surface — a hilly landscape where the height represents the loss. Training is the process of placing a ball on that landscape and letting it roll downhill. Backpropagation computes which direction is 'downhill' at your current position; the learning rate decides how big a step to take. Every epoch moves the ball a little further into the valley.
#Putting It All Together: The Training Loop
Real training combines all four steps into a loop that runs for many epochs:
- Forward pass — run the input through the network to get a prediction.
- Compute loss — measure how wrong the prediction is.
- Backpropagation — compute the gradient of the loss with respect to every weight.
- Gradient descent — update every weight by subtracting
lr * gradient.
A real network might have millions of weights, but the loop is identical — just run steps 1–4 for all of them simultaneously. Below is a complete scratch implementation of a tiny 1-layer network learning to classify points.
import math
def sigmoid(x): return 1 / (1 + math.exp(-x))
def sigmoid_deriv(s): return s * (1 - s) # s is already sigmoid(x)
# Dataset: AND gate (1 only when both inputs are 1)
data = [([0,0],0), ([0,1],0), ([1,0],0), ([1,1],1)]
# Random-ish starting weights and bias
w = [0.1, -0.2]
b = 0.0
lr = 0.5
for epoch in range(5000):
total_loss = 0
for inputs, target in data:
# Forward pass
raw = sum(wi * xi for wi, xi in zip(w, inputs)) + b
pred = sigmoid(raw)
# Loss (binary cross-entropy simplified to squared error here)
err = pred - target
total_loss += err ** 2
# Backprop: gradient flows back through sigmoid
delta = err * sigmoid_deriv(pred)
for i in range(len(w)):
w[i] -= lr * delta * inputs[i]
b -= lr * delta
if epoch % 1000 == 0:
print(f"Epoch {epoch:4d} | loss={total_loss/4:.4f}")
print("\nFinal predictions:")
for inputs, target in data:
raw = sum(wi * xi for wi, xi in zip(w, inputs)) + b
print(f" {inputs} -> {sigmoid(raw):.3f} (target {target})")The Learning Rate Trap
Setting the learning rate too high is one of the most common beginner mistakes. If lr is too large, the gradient descent step overshoots the valley — the loss might actually increase instead of decrease, and in the worst case it explodes to infinity. If this happens in your own experiments, cut the learning rate by 10x and try again.
Signs of a learning rate that's too high: the loss oscillates wildly instead of decreasing smoothly, or you see nan or inf in the output.
#What Epochs and Batches Actually Mean
You'll often see two terms in ML code:
- Epoch: one full pass through the entire training dataset. After one epoch, every example has contributed to at least one weight update. It typically takes dozens to thousands of epochs to train a network well.
- Batch size: instead of updating weights after every single example, we often accumulate gradients over a small batch (e.g. 32 or 64 examples) and then update once. This is called mini-batch gradient descent and is faster in practice because modern hardware can process many examples in parallel.
In real frameworks like PyTorch or TensorFlow, all of this — forward pass, loss, backprop, weight update — is handled with a few lines of code. But under the hood, it's the same four-step loop you've just seen from scratch.
During backpropagation, a weight receives a gradient of −2.4 and the learning rate is 0.1. What happens to the weight after the gradient descent update?
How Real Libraries Handle This
In PyTorch, you write loss.backward() and the library computes every gradient automatically using a technique called automatic differentiation — it tracks every operation in the forward pass and reverses them symbolically. In scikit-learn, you don't see any of this: the library hides the training loop entirely behind a model.fit(X, y) call. Now you know what's happening inside that call.
Key takeaways
- Training is a four-step loop: forward pass → compute loss → backpropagation → gradient descent update. Repeat for many epochs.
- The loss function converts 'how wrong' into a single number that the network is trying to minimize.
- Backpropagation assigns a gradient to every weight — a number that says how much and in which direction to nudge it to reduce the loss.
- Gradient descent subtracts a fraction of the gradient from each weight; that fraction is the learning rate, and picking it carefully is crucial.
- Under the hood, frameworks like PyTorch do exactly this — `loss.backward()` is backprop, and the optimizer's `step()` is gradient descent.
The ball takes steps downhill (opposite the slope) to reach the lowest loss. Too high a learning rate and it overshoots and bounces; too low and it crawls.
This is the MSE loss function from the lesson. What does it print?
def mse_loss(predictions, targets):
total = sum((p - t) ** 2 for p, t in zip(predictions, targets))
return total / len(predictions)
predictions = [0.5, 0.0]
targets = [1.0, 0.0]
print(f"Loss: {mse_loss(predictions, targets):.4f}")Complete the gradient descent update rule taught in the lesson. Fill in the operator and the two variable names so the weight steps downhill.
def update(weight, gradient, lr): # move the weight downhill to reduce the loss return weight lr *
Put the four steps of one training iteration into the correct order, matching the loop taught in the lesson.
grad = 2 * (output - target) * x # 3. backprop: compute the gradient
w = w - lr * grad # 4. gradient descent: update the weight
output = forward(w, x) # 1. forward pass: make a prediction
current_loss = (output - target) ** 2 # 2. compute the loss
This code has a bug — what's wrong?
def train_step(w, x, target, lr):
output = w * x
grad = 2 * (output - target) * x
# apply the gradient descent update
w = w + lr * grad
return wImplement a full training loop from scratch for a single neuron with two inputs, a sigmoid activation, and MSE loss. Train it on the dataset below for 3000 epochs with a learning rate of 0.3, and print the loss every 500 epochs.
Dataset (inputs → target): - [0.2, 0.8] → 1.0 - [0.9, 0.1] → 0.0 - [0.5, 0.5] → 1.0 - [0.1, 0.3] → 0.0
Your loop should: 1. Compute the output: sigmoid(w0*x0 + w1*x1 + bias) 2. Compute MSE loss over all four examples 3. Compute gradients for w0, w1, and bias using the chain rule 4. Update each parameter with gradient descent
Start with w0 = 0.5, w1 = -0.3, bias = 0.1.
Try it live — edit the code and hit Run to execute real Python: