Modern AIIntermediate⏱ 12 min12 / 14

Language Models & LLMs

Discover how machines read, represent, and generate language — from splitting text into tokens all the way to the attention trick that powers ChatGPT.

Every time you type a message to an AI assistant and it replies with something coherent — sometimes brilliant, sometimes confidently wrong — one question sits underneath: how does a program that only knows numbers learn to handle words?

Language is messy. Words have context, irony, ambiguity, grammar. Yet modern Large Language Models (LLMs) like GPT-4 or Claude write poetry, debug code, and summarise legal contracts. This lesson unpacks the machinery behind that, one layer at a time. No magic, just smart engineering.

#Step 1 — Turning Text into Tokens

Before a model can touch language, it needs to convert raw text into numbers. The first step is tokenisation — slicing the text into small chunks called tokens.

A token is roughly a word, but not quite. Common words like the get their own token. Rare words get split: unhappiness might become un, happiness. Spaces and punctuation are handled too. This lets the model work with a fixed vocabulary of 50,000–100,000 tokens instead of an infinite word list.

"Hello, world!" → ["Hello", ",", " world", "!"]
"tokenisation" → ["token", "isation"]
"GPT" → ["G", "PT"] (or a single token if it's common enough)

Real tokenisers (like Byte-Pair Encoding) are smarter, but the idea is the same: chop text into numbered pieces.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21def simple_tokenize(text):
    """Naive whitespace + punctuation tokeniser (real ones are smarter)."""
    tokens = []
    current = ""
    for ch in text:
        if ch in " .,!?;:":
            if current:
                tokens.append(current)
                current = ""
            if ch != " ":
                tokens.append(ch)
        else:
            current += ch
    if current:
        tokens.append(current)
    return tokens

sentence = "Language models predict the next token."
tokens = simple_tokenize(sentence)
print(tokens)
print(f"Token count: {len(tokens)}")

Note

Tokens Are Not Words

When you hear "this model has a 128k token context window", that means it can process about 96,000 words at once (a rough 0.75 words-per-token ratio). A long novel is ~100,000 words. Pricing for LLM APIs is measured in tokens, not characters or words — so knowing this saves money and confusion.

#Step 2 — Words as Points in Space (Embeddings)

Once we have tokens, each one is converted to a list of numbers called an embedding — a vector. Think of it as GPS coordinates, but instead of 2D latitude/longitude, each token gets 768 to 12,288 coordinates in a high-dimensional space.

Here's the key insight: meaning lives in geometry. Tokens with similar meanings end up near each other in this space. The network learns these coordinates during training by reading enormous amounts of text and noticing which words appear in similar contexts.

A famous demo: king − man + woman ≈ queen. The arithmetic works because the embedding space captures gender and royalty as geometric directions.

Think of it like

The City Map Analogy

Imagine every word is a building placed on a city map. Similar words are placed in the same neighbourhood: dog, cat, puppy cluster together; mortgage, interest, loan form their own district across town. The model doesn't memorise sentences — it memorises where concepts live on the map and learns to navigate between them.

When you ask "What's a synonym for happy?", the model goes to the happy building and returns its nearest neighbours: joyful, elated, pleased.

queen scores highest — the vector arithmetic actually works, even in this toy 4-dimensional space.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22import math

def dot(a, b): return sum(x*y for x,y in zip(a,b))
def norm(a): return math.sqrt(dot(a, a))
def cosine_similarity(a, b): return dot(a,b) / (norm(a) * norm(b))

# Tiny toy embeddings (real ones have 768+ dims, learned from data)
embeddings = {
    "king":   [0.9, 0.1, 0.8, 0.2],
    "queen":  [0.9, 0.9, 0.8, 0.2],
    "man":    [0.1, 0.1, 0.8, 0.2],
    "woman":  [0.1, 0.9, 0.8, 0.2],
    "puppy":  [0.2, 0.5, 0.1, 0.9],
}

# king - man + woman should be close to queen
result = [embeddings["king"][i] - embeddings["man"][i] + embeddings["woman"][i]
          for i in range(4)]

for word, vec in embeddings.items():
    sim = cosine_similarity(result, vec)
    print(f"{word:8s} similarity: {sim:.3f}")

#Step 3 — Predicting the Next Token

Now comes the engine. A language model has one job: given all the tokens it has seen so far, predict what token comes next.

That's it. Deceptively simple. The model outputs a probability distribution over its entire vocabulary — every token gets a score indicating how likely it is to come next. The top candidates for "The cat sat on the" might be:

mat — 34%
floor — 21%
sofa — 15%
roof — 8%
… (50,000 more tokens, each with a tiny probability)

To generate text, the model picks one of those tokens (often by sampling, not just taking the top one for variety), appends it, and then predicts the next one again — one token at a time, repeatedly, until it decides to stop.

The visualiser on this page shows live probabilities updating as each token is added — this is exactly what it displays.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26import math, random

def softmax(scores):
    """Convert raw scores to probabilities that sum to 1."""
    exps = [math.exp(s) for s in scores]
    total = sum(exps)
    return [e / total for e in exps]

def sample(vocab, probs):
    """Pick a token by sampling from the probability distribution."""
    r = random.random()
    cumulative = 0.0
    for token, prob in zip(vocab, probs):
        cumulative += prob
        if r < cumulative:
            return token
    return vocab[-1]

vocab  = ["mat", "floor", "sofa", "roof", "bed"]
scores = [2.1, 1.6, 1.3, 0.8, 0.5]   # raw model outputs
probs  = softmax(scores)

print("Next-token probabilities after 'The cat sat on the':")
for token, prob in zip(vocab, probs):
    bar = "#" * int(prob * 40)
    print(f"  {token:8s} {prob:.1%}  {bar}")

#Step 4 — Attention: How Context Shapes Meaning

A huge problem with early language models was word order blindness — they treated text as a bag of words, losing all sense of what modifies what.

The Transformer architecture (2017, Google Brain) solved this with a mechanism called attention. The idea: when predicting the next word, don't treat all previous words equally. Let the model decide which earlier tokens matter most right now.

In the sentence "The animal didn't cross the street because it was too tired", what does it refer to? To answer, the model must attend to animal, not street. Attention scores let it do exactly that — compute a relevance weight between every pair of tokens.

High attention weight between it and animal → they're linked
Low weight between it and street → not related here

These weights are computed dynamically for every prediction, which is why Transformers handle long-range dependencies far better than their predecessors.

Tip

Attention in One Sentence

Attention lets each token ask every other token: "How relevant are you to what I'm trying to figure out right now?" The answers are weights. The weighted sum of all previous token representations becomes the model's enriched understanding of the current position — context baked right in.

'it' attends most strongly to 'animal' — the model learns to resolve pronouns through geometry.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28import math

def dot(a, b): return sum(x*y for x,y in zip(a,b))

def attention(query, keys, values):
    """Scaled dot-product attention (simplified, 1D vectors)."""
    d = len(query)
    # Score each key against the query
    raw_scores = [dot(query, k) / math.sqrt(d) for k in keys]
    # Softmax to get weights
    exps = [math.exp(s) for s in raw_scores]
    total = sum(exps)
    weights = [e / total for e in exps]
    # Weighted sum of values
    output = sum(w * v for w, v in zip(weights, values))
    return output, weights

# Toy: 3 tokens - "animal", "street", "it"
# Query is 'it' asking: which token should I focus on?
query  = [0.9, 0.1]          # 'it' — pronoun-like
keys   = [[0.8, 0.2],        # 'animal' key
          [0.1, 0.9],        # 'street' key
          [0.5, 0.5]]        # 'tired' key
values = [1.0, 0.2, 0.5]    # simplified scalar values

result, weights = attention(query, keys, values)
print(f"Attention weights: animal={weights[0]:.2f}, street={weights[1]:.2f}, tired={weights[2]:.2f}")
print(f"Context-enriched output: {result:.3f}")

#The Full Picture: What Makes LLMs Seem Smart

Stack dozens of Transformer layers (each doing attention + feedforward computations) and train on hundreds of billions of tokens of text — books, code, Wikipedia, websites — and something remarkable happens. The model gets very, very good at completing patterns.

Because language encodes knowledge, learning to predict text forces the model to absorb facts, reasoning patterns, grammar, and style. It's not storing an index of facts; it's compressing statistical patterns across all that language into billions of learned weights.

This is why LLMs can: - Write in the style of Shakespeare (learned from his texts) - Debug Python code (learned from Stack Overflow, GitHub) - Explain a concept step by step (learned from tutorials and textbooks)

None of this is "understanding" in the way a human understands. It's extraordinarily sophisticated pattern completion at scale.

Common mistake

LLMs Don't 'Know' Things — They Generate Plausible Continuations

The single biggest misconception about LLMs: that they have a knowledge base they query. They don't. A language model generates the most statistically plausible next token given its context and training. If a confident-sounding wrong answer is more statistically common in the training data than the truth, the model can produce it with complete fluency.

This is called hallucination — the model invents facts, citations, or code that sounds right but isn't. It's not lying; it has no concept of truth. It's completing the pattern. Always verify important claims from an LLM with a primary source.

Quick check

A language model is generating text one token at a time. After producing the word 'delicious', it assigns these probabilities: 'cake'=28%, 'soup'=19%, 'music'=5%, 'the'=12%, 'and'=10%. What does it do next?

#Limits and What Comes Next

LLMs are powerful but bounded:

No persistent memory — each conversation starts fresh unless the history is explicitly included in the context window.
Knowledge cutoff — training data has a date; the model doesn't know about events after it.
Context window limits — the model can only attend to a fixed amount of text at once (though this is growing fast: 128k, 1M tokens).
No grounded reasoning — the model doesn't 'think' step-by-step unless prompted to (hence the effectiveness of 'chain of thought' prompting).
Hallucination — already covered above, but worth restating: confident ≠ correct.

Real-world LLM systems layer on top: retrieval (fetching documents to inject into context), tool use (letting the model call APIs or run code), and fine-tuning (training further on domain-specific data). But the core engine underneath is always the same: predict the next token, repeat.

Tip

In Practice: Use the Libraries

Building a real language model from scratch requires massive data and compute. In practice you'd use: - Hugging Face `transformers` — load and run pre-trained models in a few lines - OpenAI / Anthropic APIs — call frontier LLMs over HTTP - LangChain / LlamaIndex — orchestrate LLMs with retrieval and tool use

But understanding tokenisation, embeddings, and next-token prediction means you'll write better prompts, interpret outputs critically, and know exactly when to trust — and when to check — what the model says.

Key takeaways

Text is split into **tokens** (subword chunks), each converted to a numeric vector called an **embedding** that encodes meaning as geometry.
A language model's core task is predicting the **next token** given all previous tokens — generating text is just repeating this one step.
**Attention** lets the model decide which earlier tokens matter most for each prediction, enabling it to resolve context and long-range dependencies.
LLMs seem intelligent because predicting language at scale forces them to absorb facts and reasoning patterns — but they're doing pattern completion, not true understanding.
**Hallucination** is a fundamental property, not a bug to be patched: always verify important claims from an LLM with an authoritative source.

Try it yourself · Predict the next token

Pick tokens from the probabilities and watch a sentence form.

The cat ▋

Predicted next token

A language model just predicts the next token from probabilities, adds it, and repeats. Do that thousands of times and you get fluent text.

Practice challenges

Test yourself · earn XP

0/5

Predict the output#1

This is the naive tokeniser from the lesson, run on a new sentence. What does it print?

predict-output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17def simple_tokenize(text):
    tokens = []
    current = ""
    for ch in text:
        if ch in " .,!?;:":
            if current:
                tokens.append(current)
                current = ""
            if ch != " ":
                tokens.append(ch)
        else:
            current += ch
    if current:
        tokens.append(current)
    return tokens

print(len(simple_tokenize("Attention is all you need!")))

Predict the output#2

This uses the cosine_similarity helper from the lesson. What does it print?

predict-output

1
2
3
4
5
6
7
8import math

def dot(a, b): return sum(x*y for x,y in zip(a,b))
def norm(a): return math.sqrt(dot(a, a))
def cosine_similarity(a, b): return dot(a,b) / (norm(a) * norm(b))

king = [0.9, 0.1, 0.8, 0.2]
print(round(cosine_similarity(king, king), 1))

Fill in the blank#3

Complete the softmax helper from the lesson so the raw scores become probabilities that sum to 1. Fill in the math function and the divisor.

import math

def softmax(scores):
    exps = [math.(s) for s in scores]
    total = sum(exps)
    return [e /  for e in exps]

Reorder the lines#4

Put the steps of generating text one token at a time into the correct order, matching the loop taught in the lesson.

context = context + [next_token]        # 4. append it, then repeat for the next token

probs = softmax(scores)                 # 2. turn scores into a probability distribution

next_token = sample(vocab, probs)       # 3. sample one token from the distribution

scores = model(context)                 # 1. get a raw score for every vocab token

Fix the bug#5

This code has a bug — what's wrong?

fix-bug

1
2
3
4
5
6
7
8
9
10
11
12def sample(vocab, probs):
    r = random.random()
    cumulative = 0.0
    for token, prob in zip(vocab, probs):
        cumulative += prob
        if r < cumulative:
            return token
    return vocab[-1]

# caller
scores = [2.1, 1.6, 1.3]
choice = sample(vocab, scores)

Your turn

Practice exercise

Implement a tiny bigram language model from scratch. A bigram model looks at the last ONE token and predicts what comes next based on counts from training data.

Write a train(text) function that counts how often each word follows each other word.
Write a predict(word, model) function that returns the most likely next word.
Write a generate(start_word, model, n) function that generates a sequence of n tokens.

Test it on the sample sentence provided.

Try it live — edit the code and hit Run to execute real Python:

solution.py · editable

def train(text):
    """Build a bigram model: dict mapping word -> {next_word: count}."""
    words = text.lower().split()
    model = {}
    # TODO: for each consecutive pair (words[i], words[i+1]),
    # increment model[words[i]][words[i+1]]
    return model

def predict(word, model):
    """Return the most common word that follows 'word'."""
    if word not in model:
        return None
    # TODO: return the key with the highest count in model[word]
    pass

def generate(start_word, model, n=6):
    """Generate n tokens starting from start_word."""
    result = [start_word]
    current = start_word
    # TODO: call predict() n-1 more times and append each result
    return result

# Training data
text = """the cat sat on the mat the cat ate the rat
the rat ran from the cat the mat was flat the cat was fat"""

model = train(text)
print("After 'the':", predict("the", model))
print("After 'cat':", predict("cat", model))
print("Generated:", generate("the", model, 8))