Modern AIIntermediate10 min13 / 14

Computer Vision

Discover how machines really 'see': images are just grids of numbers, and a clever sliding filter called a convolution teaches computers to spot edges, shapes, and faces from raw pixels.

Your phone unlocks when it sees your face. A self-driving car stops at a red light. A hospital AI spots a tumour on an X-ray before a radiologist does. All of these feel like magic — but they all start with the same humble question: how does a computer even look at an image?

The surprising answer is that computers don't see pictures the way you do. They see a grid of plain numbers. Once you understand that, the entire field of computer vision starts to click.

#Images Are Just Numbers

Every digital image is a rectangle of tiny coloured squares called pixels. Each pixel is stored as one or more numbers:

  • A greyscale image stores one number per pixel — brightness from 0 (black) to 255 (white).
  • A colour image stores three numbers per pixel — one each for Red, Green, and Blue (the RGB channels).

So a tiny 4×4 greyscale image is literally a 4×4 grid of integers. A selfie, an MRI scan, a satellite photo — all the same idea, just much bigger.

Think of it like

The Mosaic Tile Analogy

Imagine a mosaic made of thousands of tiny coloured tiles. Stand far away and you see a portrait. Get very close and each tile is just a blob of colour — it means nothing on its own. A computer always sees the individual tiles (pixel numbers). Its challenge is to figure out the big picture from those raw numbers alone.

#The Key Operation: Convolution

The core trick of modern computer vision is called convolution. The idea is beautifully simple.

Imagine a small magnifying square — say 3×3 pixels — called a kernel (or filter). You slide this little square across the image, one step at a time. At each position you:

  1. Line up the kernel's 9 numbers with the 9 pixels underneath it.
  2. Multiply each kernel number by the pixel beneath it.
  3. Sum all 9 products into a single number.
  4. Write that number into an output grid at the same position.

That's convolution: slide, multiply, sum, move on. The output grid is called a feature map — it lights up wherever the kernel's pattern matches something in the image.

Different kernels detect different features. An edge-detection kernel has positive numbers on one side and negative on the other — where brightness changes sharply the values don't cancel, so the output is high. A blur kernel averages nearby pixels equally. In early computer vision, engineers hand-designed these kernels. The breakthrough of Convolutional Neural Networks (CNNs) is that the network learns the best kernels automatically from training data.

The kernel outputs 0 where the image is flat, and 210 where brightness jumps — it found the vertical edge!
# A tiny pure-Python convolution — no libraries needed!

image = [
    [10, 10, 10, 80, 80],
    [10, 10, 10, 80, 80],
    [10, 10, 10, 80, 80],
    [10, 10, 10, 80, 80],
    [10, 10, 10, 80, 80],
]

# Vertical edge detector: negative left, zero middle, positive right
kernel = [
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1],
]

def convolve(image, kernel):
    rows, cols, k = len(image), len(image[0]), len(kernel)
    output = []
    for r in range(rows - k + 1):
        row_out = []
        for c in range(cols - k + 1):
            total = sum(
                image[r + kr][c + kc] * kernel[kr][kc]
                for kr in range(k) for kc in range(k)
            )
            row_out.append(total)
        output.append(row_out)
    return output

result = convolve(image, kernel)
for row in result:
    print(row)
Tip

Reading the Output

Column 0 is all zeros — the kernel saw uniform dark pixels and the left-minus-right computation cancelled out. Columns 1 and 2 output 210 — that's exactly where dark meets bright. The kernel lit up at the edge. In a real CNN this high-activation region tells the next layer: "there is an edge here."

#Stacking Layers: From Edges to Objects

A single convolution layer is useful, but the real power of CNNs comes from stacking multiple layers:

  • Layer 1 learns to detect low-level features: edges, corners, colour gradients.
  • Layer 2 sees combinations of edges — textures and simple shapes like curves.
  • Layer 3+ see combinations of shapes — parts like eyes, wheels, or letters.
  • Final layers combine everything into a prediction: "this is a cat", "this is a stop sign".

The network never needs anyone to explain what an eye is. If enough cat photos are in the training data, it discovers that eyes are a useful intermediate feature on its own. This hierarchical feature learning is why deep networks are so powerful.

A typical CNN architecture looks like this: Conv → ReLU activation → Pooling (repeat several times) → Flatten → Fully Connected → Prediction. Libraries like PyTorch and TensorFlow let you build this in a few lines — but now you know what those lines are actually doing. This same building block — a kernel sliding over a grid of numbers — powers face recognition, self-driving cars, medical imaging AI, accessibility apps that describe scenes aloud, and checkout-free stores.

Common mistake

Misconception: The Network 'Sees' Like You Do

A CNN doesn't perceive images the way humans do — it only knows numbers. This matters:

  • Adversarial examples: change just a few pixel values and a network that correctly classifies a cat may suddenly call it guacamole, with high confidence.
  • Dataset bias: a CNN trained mainly on certain demographics may perform poorly on others it rarely saw during training.

Always remember: the network learns what the data shows it, nothing more.

Quick check

After a convolution with an edge-detection kernel, a region in the output feature map has a value close to zero. What does that most likely mean?

Note

Going Further

Modern architectures push beyond classic CNNs: - ResNets add 'skip connections' so very deep networks (100+ layers) can still train reliably. - Vision Transformers (ViTs) split images into patches and use attention mechanisms instead of convolutions. - CLIP learns from image-text pairs so it can match photos to natural-language descriptions.

But every one of them still begins with the same insight: an image is a grid of numbers, and patterns can be found by sliding the right filter across them.

Key takeaways

  • An image is just a grid of numbers (pixel values) — computers never see 'pictures', only arrays of integers.
  • Convolution slides a small kernel (filter) over the image, multiplying and summing at each position to produce a feature map that highlights specific patterns like edges.
  • Different kernels detect different features; CNNs learn the best kernels automatically from training data rather than requiring hand-design.
  • Stacking convolutional layers builds up a hierarchy: edges to textures to shapes to objects, allowing networks to recognise complex things from raw pixels.
  • CNNs see only statistical patterns in numbers — they can be fooled by tiny pixel changes and reflect biases present in their training data.
Practice challenges
Test yourself · earn XP
0/5
Predict the output#1

A single 3x3 kernel is applied to one 3x3 patch of pixels: multiply each kernel value by the pixel beneath it, then sum all 9 products. What does this snippet print?

predict-output
patch = [
    [10, 10, 80],
    [10, 10, 80],
    [10, 10, 80],
]
kernel = [
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1],
]
total = sum(
    patch[r][c] * kernel[r][c]
    for r in range(3) for c in range(3)
)
print(total)
Predict the output#2

An image is a grid of numbers. This snippet reports the size of the output feature map when a 3x3 kernel slides over a 6x6 image (the kernel's top-left corner can occupy positions 0 through rows-3). What does it print?

predict-output
image = [[0] * 6 for _ in range(6)]
k = 3
rows = len(image)
cols = len(image[0])
out_rows = rows - k + 1
out_cols = cols - k + 1
print(out_rows, out_cols)
Fill in the blank#3

Complete the convolution loop from the lesson. At each kernel position we multiply each kernel value by the pixel beneath it and accumulate. Fill in the operation that combines the two values.

def convolve(image, kernel):
    k = len(kernel)
    rows, cols = len(image), len(image[0])
    output = []
    for r in range(rows - k + 1):
        row_out = []
        for c in range(cols - k + 1):
            total = sum(
                image[r + kr][c + kc]  kernel[kr][kc]
                for kr in range(k) for kc in range(k)
            )
            row_out.append(total)
        output.append(row_out)
    return output
Reorder the lines#4

Put the stages of a typical CNN pipeline into the correct order, from raw image to final prediction, as described in the lesson.

1
Conv layer 2+: combine edges into shapes and object parts
2
Input: the image as a grid of numbers
3
Fully connected layer produces the prediction ('cat')
4
Flatten the final feature maps into a vector
5
Conv layer 1: detect low-level features like edges
Fix the bug#5

This code is meant to load a greyscale pixel value, but it can produce an invalid pixel. What's wrong?

fix-bug
# Greyscale pixel: brightness from 0 (black) to 255 (white)
def clamp_pixel(value):
    if value < 0:
        return 0
    if value > 256:
        return 256
    return value
Your turn
Practice exercise

Write a function apply_blur(image, kernel) that applies a 3×3 blur kernel to a given 2D list of pixel values. A blur kernel averages each pixel with its neighbours — all nine weights equal 1/9. Return the feature map as a 2D list of floats rounded to 2 decimal places.

Test it on the 5×5 image provided and print each row of the result.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable