Language Models & LLMs
Discover how machines read, represent, and generate language — from splitting text into tokens all the way to the attention trick that powers ChatGPT.
Every time you type a message to an AI assistant and it replies with something coherent — sometimes brilliant, sometimes confidently wrong — one question sits underneath: how does a program that only knows numbers learn to handle words?
Language is messy. Words have context, irony, ambiguity, grammar. Yet modern Large Language Models (LLMs) like GPT-4 or Claude write poetry, debug code, and summarise legal contracts. This lesson unpacks the machinery behind that, one layer at a time. No magic, just smart engineering.
#Step 1 — Turning Text into Tokens
Before a model can touch language, it needs to convert raw text into numbers. The first step is tokenisation — slicing the text into small chunks called tokens.
A token is roughly a word, but not quite. Common words like the get their own token. Rare words get split: unhappiness might become un, happiness. Spaces and punctuation are handled too. This lets the model work with a fixed vocabulary of 50,000–100,000 tokens instead of an infinite word list.
"Hello, world!"→["Hello", ",", " world", "!"]"tokenisation"→["token", "isation"]"GPT"→["G", "PT"](or a single token if it's common enough)
def simple_tokenize(text):
"""Naive whitespace + punctuation tokeniser (real ones are smarter)."""
tokens = []
current = ""
for ch in text:
if ch in " .,!?;:":
if current:
tokens.append(current)
current = ""
if ch != " ":
tokens.append(ch)
else:
current += ch
if current:
tokens.append(current)
return tokens
sentence = "Language models predict the next token."
tokens = simple_tokenize(sentence)
print(tokens)
print(f"Token count: {len(tokens)}")Tokens Are Not Words
When you hear "this model has a 128k token context window", that means it can process about 96,000 words at once (a rough 0.75 words-per-token ratio). A long novel is ~100,000 words. Pricing for LLM APIs is measured in tokens, not characters or words — so knowing this saves money and confusion.
#Step 2 — Words as Points in Space (Embeddings)
Once we have tokens, each one is converted to a list of numbers called an embedding — a vector. Think of it as GPS coordinates, but instead of 2D latitude/longitude, each token gets 768 to 12,288 coordinates in a high-dimensional space.
Here's the key insight: meaning lives in geometry. Tokens with similar meanings end up near each other in this space. The network learns these coordinates during training by reading enormous amounts of text and noticing which words appear in similar contexts.
A famous demo: king − man + woman ≈ queen. The arithmetic works because the embedding space captures gender and royalty as geometric directions.
The City Map Analogy
Imagine every word is a building placed on a city map. Similar words are placed in the same neighbourhood: dog, cat, puppy cluster together; mortgage, interest, loan form their own district across town. The model doesn't memorise sentences — it memorises where concepts live on the map and learns to navigate between them.
When you ask "What's a synonym for happy?", the model goes to the happy building and returns its nearest neighbours: joyful, elated, pleased.
import math
def dot(a, b): return sum(x*y for x,y in zip(a,b))
def norm(a): return math.sqrt(dot(a, a))
def cosine_similarity(a, b): return dot(a,b) / (norm(a) * norm(b))
# Tiny toy embeddings (real ones have 768+ dims, learned from data)
embeddings = {
"king": [0.9, 0.1, 0.8, 0.2],
"queen": [0.9, 0.9, 0.8, 0.2],
"man": [0.1, 0.1, 0.8, 0.2],
"woman": [0.1, 0.9, 0.8, 0.2],
"puppy": [0.2, 0.5, 0.1, 0.9],
}
# king - man + woman should be close to queen
result = [embeddings["king"][i] - embeddings["man"][i] + embeddings["woman"][i]
for i in range(4)]
for word, vec in embeddings.items():
sim = cosine_similarity(result, vec)
print(f"{word:8s} similarity: {sim:.3f}")#Step 3 — Predicting the Next Token
Now comes the engine. A language model has one job: given all the tokens it has seen so far, predict what token comes next.
That's it. Deceptively simple. The model outputs a probability distribution over its entire vocabulary — every token gets a score indicating how likely it is to come next. The top candidates for "The cat sat on the" might be:
mat— 34%floor— 21%sofa— 15%roof— 8%- … (50,000 more tokens, each with a tiny probability)
To generate text, the model picks one of those tokens (often by sampling, not just taking the top one for variety), appends it, and then predicts the next one again — one token at a time, repeatedly, until it decides to stop.
import math, random
def softmax(scores):
"""Convert raw scores to probabilities that sum to 1."""
exps = [math.exp(s) for s in scores]
total = sum(exps)
return [e / total for e in exps]
def sample(vocab, probs):
"""Pick a token by sampling from the probability distribution."""
r = random.random()
cumulative = 0.0
for token, prob in zip(vocab, probs):
cumulative += prob
if r < cumulative:
return token
return vocab[-1]
vocab = ["mat", "floor", "sofa", "roof", "bed"]
scores = [2.1, 1.6, 1.3, 0.8, 0.5] # raw model outputs
probs = softmax(scores)
print("Next-token probabilities after 'The cat sat on the':")
for token, prob in zip(vocab, probs):
bar = "#" * int(prob * 40)
print(f" {token:8s} {prob:.1%} {bar}")#Step 4 — Attention: How Context Shapes Meaning
A huge problem with early language models was word order blindness — they treated text as a bag of words, losing all sense of what modifies what.
The Transformer architecture (2017, Google Brain) solved this with a mechanism called attention. The idea: when predicting the next word, don't treat all previous words equally. Let the model decide which earlier tokens matter most right now.
In the sentence "The animal didn't cross the street because it was too tired", what does it refer to? To answer, the model must attend to animal, not street. Attention scores let it do exactly that — compute a relevance weight between every pair of tokens.
- High attention weight between
itandanimal→ they're linked - Low weight between
itandstreet→ not related here
These weights are computed dynamically for every prediction, which is why Transformers handle long-range dependencies far better than their predecessors.
Attention in One Sentence
Attention lets each token ask every other token: "How relevant are you to what I'm trying to figure out right now?" The answers are weights. The weighted sum of all previous token representations becomes the model's enriched understanding of the current position — context baked right in.
import math
def dot(a, b): return sum(x*y for x,y in zip(a,b))
def attention(query, keys, values):
"""Scaled dot-product attention (simplified, 1D vectors)."""
d = len(query)
# Score each key against the query
raw_scores = [dot(query, k) / math.sqrt(d) for k in keys]
# Softmax to get weights
exps = [math.exp(s) for s in raw_scores]
total = sum(exps)
weights = [e / total for e in exps]
# Weighted sum of values
output = sum(w * v for w, v in zip(weights, values))
return output, weights
# Toy: 3 tokens - "animal", "street", "it"
# Query is 'it' asking: which token should I focus on?
query = [0.9, 0.1] # 'it' — pronoun-like
keys = [[0.8, 0.2], # 'animal' key
[0.1, 0.9], # 'street' key
[0.5, 0.5]] # 'tired' key
values = [1.0, 0.2, 0.5] # simplified scalar values
result, weights = attention(query, keys, values)
print(f"Attention weights: animal={weights[0]:.2f}, street={weights[1]:.2f}, tired={weights[2]:.2f}")
print(f"Context-enriched output: {result:.3f}")#The Full Picture: What Makes LLMs Seem Smart
Stack dozens of Transformer layers (each doing attention + feedforward computations) and train on hundreds of billions of tokens of text — books, code, Wikipedia, websites — and something remarkable happens. The model gets very, very good at completing patterns.
Because language encodes knowledge, learning to predict text forces the model to absorb facts, reasoning patterns, grammar, and style. It's not storing an index of facts; it's compressing statistical patterns across all that language into billions of learned weights.
This is why LLMs can: - Write in the style of Shakespeare (learned from his texts) - Debug Python code (learned from Stack Overflow, GitHub) - Explain a concept step by step (learned from tutorials and textbooks)
None of this is "understanding" in the way a human understands. It's extraordinarily sophisticated pattern completion at scale.
LLMs Don't 'Know' Things — They Generate Plausible Continuations
The single biggest misconception about LLMs: that they have a knowledge base they query. They don't. A language model generates the most statistically plausible next token given its context and training. If a confident-sounding wrong answer is more statistically common in the training data than the truth, the model can produce it with complete fluency.
This is called hallucination — the model invents facts, citations, or code that sounds right but isn't. It's not lying; it has no concept of truth. It's completing the pattern. Always verify important claims from an LLM with a primary source.
A language model is generating text one token at a time. After producing the word 'delicious', it assigns these probabilities: 'cake'=28%, 'soup'=19%, 'music'=5%, 'the'=12%, 'and'=10%. What does it do next?
#Limits and What Comes Next
LLMs are powerful but bounded:
- No persistent memory — each conversation starts fresh unless the history is explicitly included in the context window.
- Knowledge cutoff — training data has a date; the model doesn't know about events after it.
- Context window limits — the model can only attend to a fixed amount of text at once (though this is growing fast: 128k, 1M tokens).
- No grounded reasoning — the model doesn't 'think' step-by-step unless prompted to (hence the effectiveness of 'chain of thought' prompting).
- Hallucination — already covered above, but worth restating: confident ≠ correct.
Real-world LLM systems layer on top: retrieval (fetching documents to inject into context), tool use (letting the model call APIs or run code), and fine-tuning (training further on domain-specific data). But the core engine underneath is always the same: predict the next token, repeat.
In Practice: Use the Libraries
Building a real language model from scratch requires massive data and compute. In practice you'd use: - Hugging Face `transformers` — load and run pre-trained models in a few lines - OpenAI / Anthropic APIs — call frontier LLMs over HTTP - LangChain / LlamaIndex — orchestrate LLMs with retrieval and tool use
But understanding tokenisation, embeddings, and next-token prediction means you'll write better prompts, interpret outputs critically, and know exactly when to trust — and when to check — what the model says.
Key takeaways
- Text is split into **tokens** (subword chunks), each converted to a numeric vector called an **embedding** that encodes meaning as geometry.
- A language model's core task is predicting the **next token** given all previous tokens — generating text is just repeating this one step.
- **Attention** lets the model decide which earlier tokens matter most for each prediction, enabling it to resolve context and long-range dependencies.
- LLMs seem intelligent because predicting language at scale forces them to absorb facts and reasoning patterns — but they're doing pattern completion, not true understanding.
- **Hallucination** is a fundamental property, not a bug to be patched: always verify important claims from an LLM with an authoritative source.
Predicted next token
A language model just predicts the next token from probabilities, adds it, and repeats. Do that thousands of times and you get fluent text.
This is the naive tokeniser from the lesson, run on a new sentence. What does it print?
def simple_tokenize(text):
tokens = []
current = ""
for ch in text:
if ch in " .,!?;:":
if current:
tokens.append(current)
current = ""
if ch != " ":
tokens.append(ch)
else:
current += ch
if current:
tokens.append(current)
return tokens
print(len(simple_tokenize("Attention is all you need!")))This uses the cosine_similarity helper from the lesson. What does it print?
import math
def dot(a, b): return sum(x*y for x,y in zip(a,b))
def norm(a): return math.sqrt(dot(a, a))
def cosine_similarity(a, b): return dot(a,b) / (norm(a) * norm(b))
king = [0.9, 0.1, 0.8, 0.2]
print(round(cosine_similarity(king, king), 1))Complete the softmax helper from the lesson so the raw scores become probabilities that sum to 1. Fill in the math function and the divisor.
import math def softmax(scores): exps = [math.(s) for s in scores] total = sum(exps) return [e / for e in exps]
Put the steps of generating text one token at a time into the correct order, matching the loop taught in the lesson.
context = context + [next_token] # 4. append it, then repeat for the next token
probs = softmax(scores) # 2. turn scores into a probability distribution
next_token = sample(vocab, probs) # 3. sample one token from the distribution
scores = model(context) # 1. get a raw score for every vocab token
This code has a bug — what's wrong?
def sample(vocab, probs):
r = random.random()
cumulative = 0.0
for token, prob in zip(vocab, probs):
cumulative += prob
if r < cumulative:
return token
return vocab[-1]
# caller
scores = [2.1, 1.6, 1.3]
choice = sample(vocab, scores)Implement a tiny bigram language model from scratch. A bigram model looks at the last ONE token and predicts what comes next based on counts from training data.
- Write a
train(text)function that counts how often each word follows each other word. - Write a
predict(word, model)function that returns the most likely next word. - Write a
generate(start_word, model, n)function that generates a sequence of n tokens.
Test it on the sample sentence provided.
Try it live — edit the code and hit Run to execute real Python: