What is a Large Language Model?
You've probably used ChatGPT, Claude, or Gemini. You type something in, and they respond with surprisingly human-like text. But what are they, really?
At its core, a Large Language Model (LLM) is just a very sophisticated next-word predictor. That's it. When you type "The capital of France is", the model predicts that the next word is probably "Paris".
The "Large" in Large Language Model refers to two things:
- Large training data: These models are trained on billions of web pages, books, and articles
- Large number of parameters: GPT-4 has hundreds of billions of internal "knobs" it can adjust
But before we can understand how they predict the next word, we need to understand how they read text in the first place. Computers don't understand words like we do — they need everything converted to numbers.
Definition: Large Language Models
A Large Language Model (LLM) is a neural network trained to model the probability
distribution of sequences of tokens in natural language. Formally, given a sequence of tokens
x₁, x₂, ..., xₜ, the model estimates:
This is the probability distribution over the next token given all previous tokens. The model generates text by sampling from this distribution repeatedly — predicting one token, adding it to the sequence, then predicting the next.
Scale Parameters
Modern LLMs are characterized by:
- Parameter count: 7B to 500B+ trainable parameters
- Training tokens: 1T to 15T+ tokens seen during training
- Vocabulary size: 32K to 200K+ unique tokens
- Context window: 2K to 2M+ tokens in the input sequence
Tokens: Breaking Text into Pieces
Here's a crucial fact: LLMs don't read words. They read tokens.
A token is a piece of text — it could be a whole word, part of a word, or even just a single character. The process of splitting text into tokens is called tokenization.
🎮 Tokenizer Demo
Type text below to see how it's split into tokens:
Notice a few things:
- "Hello" is one token, but "tokenization" splits into "token" + "ization"
- Spaces are attached to the following word (" world" not "world")
- Punctuation gets its own tokens
- Common words are usually single tokens; rare words get split
Byte Pair Encoding (BPE)
The most common tokenization method is called Byte Pair Encoding. Here's the intuition:
- Start with every individual character as a token
- Find the most common pair of adjacent tokens
- Merge them into a new token
- Repeat thousands of times
This is why common words like "the" become single tokens, while rare words get broken down. The tokenizer learns which combinations are most useful!
Tokenization Algorithms
Tokenization is the process of mapping text to sequences of integers from a finite vocabulary
V. Formally, a tokenizer is a function:
where Σ* is the set of all strings over the input alphabet, and V* is the set of all sequences of vocabulary indices.
Byte Pair Encoding (BPE)
BPE is a subword tokenization algorithm that iteratively merges frequent character pairs. Given a training corpus, BPE:
- Initialize vocabulary with all Unicode characters
- Count frequency of all adjacent token pairs
- Merge the most frequent pair into a new token
- Repeat until vocabulary size reaches target (typically 32K-200K)
The merge operations are greedy and deterministic, ensuring encoding is unambiguous. Decoding is simply concatenation of token strings.
Modern Variants
| Algorithm | Used In | Key Feature |
|---|---|---|
| BPE | GPT-2, GPT-3, RoBERTa | Byte-level, merges frequent pairs |
| WordPiece | BERT, DistilBERT | Maximizes likelihood of training data |
| SentencePiece | T5, LLaMA | Language-agnostic, handles any text |
| tiktoken | GPT-4, Claude | Fast BPE with regex preprocessing |
Embeddings: Turning Tokens into Numbers
Now we have tokens as integers. But neural networks work with continuous numbers, not discrete integers. We need to convert each token into a vector — a list of numbers.
This is called an embedding. Think of it as giving each token its own "address" in a high-dimensional space.
📍 Embedding Visualization
Each word becomes a vector of numbers:
Why 768 Dimensions?
Different models use different embedding sizes:
- Small models: 256-512 dimensions
- Medium models (GPT-2): 768 dimensions
- Large models (GPT-3): 1,536-12,288 dimensions
More dimensions = more capacity to capture nuance, but also more computation.
Analogies in Vector Space
Here's where it gets wild: embeddings capture semantic relationships. The famous example:
If you take the vector for "king", subtract "man", and add "woman", you get approximately the vector for "queen"!
This works because the embedding space learned that "gender" is a direction you can move in. "Man → woman" is roughly the same vector as "king → queen" or "actor → actress".
Embedding Layers
An embedding layer is a learnable lookup table E ∈ ℝ^(V×d) where:
V= vocabulary size (number of unique tokens)d= embedding dimension (model width)E[i]= the d-dimensional vector for token i
For input token index i, the embedding is retrieved as:
This is equivalent to a one-hot encoding followed by matrix multiplication:
where e_i is the one-hot vector with 1 at position i.
Semantic Structure
The embedding space exhibits linear structure that captures semantic relationships. For word analogies of the form "a is to b as c is to d" (e.g., man:woman::king:queen):
This suggests the embedding space encodes semantic attributes as directions:
- Gender: vector from "man" to "woman"
- Plural: vector from "cat" to "cats"
- Capital: vector from "france" to "paris"
Embedding Dimensions in Practice
| Model | Parameters | Embedding Dim (d_model) |
|---|---|---|
| GPT-2 Small | 124M | 768 |
| GPT-2 XL | 1.5B | 1,600 |
| GPT-3 | 175B | 12,288 |
| LLaMA 2 | 70B | 8,192 |
The Core Task: Predict the Next Token
Now we understand the setup:
- Text gets split into tokens
- Tokens get converted to embedding vectors
- The model processes these vectors...
- And outputs a probability distribution over the next token
🎯 Next Token Prediction Demo
Given the context, what comes next?
The model outputs probabilities for every token in its vocabulary (often 50,000+ tokens). "Paris" gets 87% probability, "the" gets 5%, and so on.
How Does It Actually Generate Text?
To generate text, the model repeats this process:
- Given the current text, predict next token probabilities
- Sample (or pick the highest probability token)
- Add that token to the text
- Repeat!
This is called autoregressive generation — each new token becomes part of the context for predicting the next one.
Autoregressive Language Modeling
The training objective for autoregressive language models is to maximize the likelihood of the training data under the model's distribution:
where θ represents the model parameters. This is equivalent to minimizing the cross-entropy loss:
For language modeling, p is the empirical distribution (one-hot at the true next token) and q is the model's predicted distribution.
Sampling Strategies
At inference time, we sample from the predicted distribution P(xₜ₊₁ | x≤ₜ). Common strategies:
| Method | Description | Effect |
|---|---|---|
| Greedy | argmax P(x) | Deterministic, often repetitive |
| Temperature | Sample from P(x)^(1/T) | T < 1: more focused; T > 1: more random |
| Top-k | Sample from top k tokens only | Prevents very unlikely tokens |
| Top-p (nucleus) | Sample from smallest set with cumsum(p) ≥ p | Dynamic vocabulary restriction |
The Complete Pipeline
Let's put it all together. Here's what happens when you type "Hello" to ChatGPT:
🔄 Full Pipeline Visualization
"Hello, how are"
Each token → 768-dimensional vector
(Transformers — coming in Level 3!)
" you" (45%), " you?" (30%), " you!" (15%)...
" you"
This process repeats for every token generated. If the model generates 100 tokens, it runs through this pipeline 100 times!
What About the Neural Network?
We've glossed over step 4 — the actual neural network. That's the heart of the model, and it's what we'll explore in the next levels:
- Level 2: Neural Networks — how individual neurons work and learn
- Level 3: Transformers — the architecture powering modern LLMs
- Level 4: Training — how these models actually learn from data
- Level 5: The Math — the underlying mathematics
End-to-End Architecture
A language model is a composition of functions:
Where:
Embedding:V → ℝ^(T×d) maps token indices to vectorsTransformer:ℝ^(T×d) → ℝ^(T×d) processes the sequenceW_out:ℝ^(d×V) projects to vocabulary logitssoftmax:converts logits to probabilities
For autoregressive generation, we apply causal masking so position i only attends to positions ≤ i.
Computational Complexity
The forward pass for a transformer with L layers, sequence length T, and dimension d:
- Attention: O(T² · d) — quadratic in sequence length
- Feedforward: O(T · d²) — linear in sequence length
- Total: O(L · T · d · (T + d))
Vocabulary: The Model's Dictionary
Every LLM has a fixed vocabulary — a list of every token it knows. This vocabulary is a crucial design decision that affects everything else.
Vocabulary Sizes Across Models
| Model | Vocab Size | Tokenizer | Avg Tokens per Word |
|---|---|---|---|
| GPT-2 | 50,257 | BPE | ~1.3 |
| GPT-3/4 | 100,277 | tiktoken (BPE) | ~1.1 |
| LLaMA 2 | 32,000 | SentencePiece | ~1.4 |
| LLaMA 3 | 128,000 | tiktoken (BPE) | ~1.0 |
| Claude | ~100,000 | tiktoken (BPE) | ~1.1 |
The Vocabulary Size Trade-off
Choosing the vocabulary size is a balancing act:
- Small vocabulary (10K): Model is compact, but every word gets split into many sub-pieces. "Unbelievable" might become 4+ tokens, making sequences long.
- Medium vocabulary (32K-50K): Good balance. Common words are single tokens, rare words split into 2-3 sub-pieces.
- Large vocabulary (100K-256K): More words are single tokens (shorter sequences), but the embedding matrix is larger, and the output softmax has more options.
Subword Tokenization Saves the Day
The genius of subword tokenization (BPE, WordPiece) is that it handles any input while keeping the vocabulary manageable:
- Common words → single token ("the", "and", "is")
- Common subwords → single token ("un", "ing", "tion")
- Rare words → split into subword tokens ("antidisestablishmentarianism" → "anti" + "dis" + "establish" + "ment" + "arian" + "ism")
- Unknown words → always decomposable into characters (guaranteed by byte-level BPE)
Vocabulary Construction and Properties
The vocabulary V defines the set of all tokens the model can produce. Its size |V| affects both the embedding matrix E ∈ ℝ^(|V|×d) and the output projection W_out ∈ ℝ^(d×|V|).
Vocabulary Size and Model Parameters
The total parameter count includes vocabulary-dependent terms:
For LLaMA 2 70B with |V| = 32,000 and d = 8,192:
- Embedding: 32,000 × 8,192 = 262M parameters
- Output projection: 8,192 × 32,000 = 262M parameters
- Vocabulary overhead: ~524M parameters (~0.7% of total)
For larger vocabularies (128K), this overhead grows to ~2B parameters — still modest relative to the transformer layers, but the computational cost of softmax over 128K classes is significant.
BPE Merge Algorithm — Formal Specification
Given a corpus C and target vocabulary size V_target:
- Initialize vocabulary V₀ from all character-level tokens
- For each step t = 1, 2, ..., until |V_t| reaches target:
- Count all adjacent pairs (a, b) across the corpus
- Find the pair (a*, b*) with maximum count: (a*, b*) = argmax count(a, b)
- Merge all occurrences: replace "a b" with "ab" throughout corpus
- Add "ab" to vocabulary: V_{t+1} = V_t ∪ {"ab"}
- Record the sequence of mergers M = [(a₁,b₁), (a₂,b₂), ...] for encoding
How LLMs Choose the Next Word
Once the model produces probabilities for every token, how does it pick one? This is where sampling strategies come in — and they dramatically affect the quality and personality of the output.
Greedy Decoding
The simplest approach: always pick the most probable token.
Greedy Decoding Example
Step 1: Pick " beautiful" (45%)
Step 2: Pick " and" (52%)
Step 3: Pick " sunny" (61%)
Step 4: Pick "." (88%)
Result: "The weather today is beautiful and sunny." — Correct but boring!
Greedy decoding often produces repetitive, generic text. The model gets stuck in loops because it always makes the "safest" choice.
Temperature Sampling
Temperature controls how "creative" vs "safe" the model is:
Temperature Effect on Probabilities
Nearly greedy
"London" → 1%
"Berlin" → 0.3%
Default/normal
"London" → 5%
"Berlin" → 3%
Very random
"London" → 20%
"Berlin" → 15%
- Low temperature (0.1-0.5): Model is focused and deterministic. Good for code, factual answers.
- Medium temperature (0.7-1.0): Balanced. Good for conversation, general writing.
- High temperature (1.5-2.0): Creative and unpredictable. Good for brainstorming, poetry. Can produce nonsense.
Top-K and Top-P (Nucleus) Sampling
Temperature alone isn't enough — we also need to prevent the model from picking truly bizarre tokens:
- Top-K sampling: Only consider the K most likely tokens. K=50 means the model can only pick from the top 50 options.
- Top-P (nucleus) sampling: Consider the smallest set of tokens whose cumulative probability ≥ P. P=0.9 means pick from tokens that together cover 90% of the probability mass.
Sampling Strategies — Formal Treatment
Given the model's output distribution P(x_{t+1} | x_{<=t}), sampling strategies define how we select the next token.
Temperature Scaling
Temperature τ modifies the distribution by scaling logits before softmax:
Where z_i are the raw logits (unnormalized scores). As τ → 0, the distribution approaches a delta function on the argmax (greedy). As τ → ∞, it approaches a uniform distribution.
Top-K Sampling
Restrict sampling to the top K tokens:
P_K(x_i | context) = P(x_i | context) / Σ_{j ∈ V_K} P(x_j | context) if i ∈ V_K
P_K(x_i | context) = 0 if i ∉ V_K
Nucleus (Top-P) Sampling
Select the minimal set of tokens whose cumulative probability exceeds p:
where x_{(i)} are tokens sorted in decreasing order of probability. This adapts dynamically: for concentrated distributions, few tokens are considered; for flat distributions, many are included.
Repetition Penalty
To further reduce repetition, a penalty is applied to already-generated tokens:
This is equivalent to lowering the logit of previously-seen tokens, making the model prefer fresh tokens over repetitions.
The Context Window: How Much Can It Remember?
Every LLM has a fixed context window — the maximum number of tokens it can process at once. Think of it as the model's "working memory."
Context Window Sizes
| Model | Context Window | Approximate Pages |
|---|---|---|
| Original GPT-2 | 1,024 tokens | ~2 pages |
| GPT-3 | 4,096 tokens | ~8 pages |
| GPT-4 Turbo | 128,000 tokens | ~256 pages |
| Gemini 1.5 Pro | 1,000,000+ tokens | ~2,000 pages |
Why Can't We Just Make It Infinite?
The context window is limited by a fundamental mathematical property of the transformer: self-attention scales quadratically with sequence length.
- 2K context → ~4M attention computations per layer
- 8K context → ~64M attention computations per layer (16× more)
- 128K context → ~16B attention computations per layer (4000× more)
- 1M context → ~1T attention computations per layer (250,000× more)
Every time you double the context window, attention computation increases by 4×. This is why extending context is one of the hardest engineering challenges in AI.
Context Window — Computational Analysis
The context window length T directly constrains the maximum input + output sequence. For a transformer with L layers, hidden dimension d, and sequence length T:
Feed-forward: O(L · T · d²) — linear in T
Total: O(L · T · d · (T + d))
Memory Requirements
During inference, we must cache the Key and Value matrices for all previous positions:
= 2 · 96 · 96 · 128 · T · 2 bytes (GPT-3 scale)
= 4.7MB per token of context
For a 128K token context window, this requires approximately 600GB of KV cache memory alone.
Extending Context
Several approaches exist to handle longer contexts:
| Method | Approach | Trade-off |
|---|---|---|
| Rotary Position Embeddings (RoPE) | Interpolate position frequencies | May lose fine positional info |
| ALiBi | Add linear bias to attention scores | No positional embeddings needed |
| Flash Attention | Optimized memory-efficient attention | Same computation, less memory |
| Sliding Window | Local attention with fixed window | Loses global context |
| Sparse Attention | Attend to subset of positions | Approximate, not exact |
How Do We Know If the Model Is Good?
To train a model, we need a way to measure how wrong its predictions are. This measurement is called the loss function — and it's the compass that guides training.
Cross-Entropy Loss (Intuitive)
The most common loss function for language models is cross-entropy. Here's the intuition:
Understanding Cross-Entropy
Imagine the model sees "The cat sat on the" and must predict "mat".
Okay model: Assigns "mat" probability 0.3 → Loss = -log(0.3) = 1.20 (medium loss)
Bad model: Assigns "mat" probability 0.01 → Loss = -log(0.01) = 4.61 (high loss)
The loss is -log(probability). When the model is confident and right, loss is low. When the model is confident and wrong, loss is very high.
Training minimizes this loss across all training examples. When the loss is minimized, the model is assigning high probability to the correct next token.
Cross-Entropy Loss — Formal Definition
For a sequence of tokens x₁, x₂, ..., x_T, the cross-entropy loss is:
This averages the negative log-probability the model assigns to each correct token, conditioned on all previous tokens.
Equivalence to Cross-Entropy
For each position i, let p_i be the one-hot vector at position xᵢ (the true token) and q_i be the model's predicted distribution. Then:
Since p_i is one-hot (only 1 at the true token position), all terms vanish except the one corresponding to the actual next token. This is why language model training uses the term "cross-entropy loss" interchangeably with "negative log-likelihood."
Perplexity
Perplexity is the exponential of average cross-entropy:
Perplexity has an intuitive interpretation: it's the model's effective branching factor. If perplexity is 15, the model is as uncertain as if choosing uniformly among 15 tokens at each step.
• GPT-2: ~30-40 on web text
• GPT-3: ~15-20 on web text
• GPT-4: ~10-15 on web text
• Random baseline: ~|V| (50,000+)
How Does Training Actually Work?
Training an LLM means adjusting billions of parameters so the model assigns high probability to good text and low probability to bad text. Here's the process at a high level:
The Training Loop
This loop runs millions of times during training, on trillions of tokens, across thousands of GPUs. Each step nudges the parameters slightly, and over time, the model learns to predict text better and better.
The Scale of Training
Training Scale Comparison
| Model | Parameters | Training Data | GPU Hours | Estimated Cost |
|---|---|---|---|---|
| GPT-2 (2019) | 1.5B | 40GB text | ~256 V100-years | ~$50K |
| GPT-3 (2020) | 175B | 570GB text | ~355,000 V100-years | ~$4.6M |
| LLaMA 2 70B (2023) | 70B | 2T tokens | ~1,720,000 A100-hours | ~$2-5M |
| GPT-4 (2023) | ~1.8T (est.) | ~13T tokens (est.) | ~tens of millions GPU-hours | ~$100M+ (est.) |
Training — Formal Framework
The training objective for autoregressive language models is maximum likelihood estimation (MLE):
This is equivalent to minimizing the average cross-entropy loss over the training corpus D.
Stochastic Gradient Descent
Computing the gradient over the entire dataset is infeasible. Instead, we estimate it from small batches:
Where B is a minibatch of training sequences. The batch size B is a critical hyperparameter:
- Small batch (32-128): Noisy but fast updates, better generalization
- Medium batch (512-2048): Good balance of stability and speed
- Large batch (4096-4M): Parallelizable across many GPUs, but may generalize worse
Learning Rate Schedule
The learning rate η is not constant during training. Common schedules:
- Warmup: Start with η ≈ 0 and linearly increase over first N steps
- Cosine decay: η_t = η_min + 0.5(η_max - η_min)(1 + cos(πt/T))
- Linear decay: η_t = η_max · (1 - t/T)