Level 01: Foundations

What is a Large Language Model?

You've probably used ChatGPT, Claude, or Gemini. You type something in, and they respond with surprisingly human-like text. But what are they, really?

At its core, a Large Language Model (LLM) is just a very sophisticated next-word predictor. That's it. When you type "The capital of France is", the model predicts that the next word is probably "Paris".

            The Core Idea: An LLM reads text and predicts what comes next, one word (or piece of word) at a time.
          

The "Large" in Large Language Model refers to two things:

Large training data: These models are trained on billions of web pages, books, and articles
Large number of parameters: GPT-4 has hundreds of billions of internal "knobs" it can adjust

But before we can understand how they predict the next word, we need to understand how they read text in the first place. Computers don't understand words like we do — they need everything converted to numbers.

Definition: Large Language Models

A Large Language Model (LLM) is a neural network trained to model the probability distribution of sequences of tokens in natural language. Formally, given a sequence of tokens x₁, x₂, ..., xₜ, the model estimates:

P(xₜ₊₁ | x₁, x₂, ..., xₜ)

This is the probability distribution over the next token given all previous tokens. The model generates text by sampling from this distribution repeatedly — predicting one token, adding it to the sequence, then predicting the next.

            Key Insight: Despite their impressive capabilities, LLMs are fundamentally 
            autoregressive models performing next-token prediction. All other behaviors (reasoning, 
            instruction following, etc.) emerge from this simple objective trained at scale.
          

Scale Parameters

Modern LLMs are characterized by:

Parameter count: 7B to 500B+ trainable parameters
Training tokens: 1T to 15T+ tokens seen during training
Vocabulary size: 32K to 200K+ unique tokens
Context window: 2K to 2M+ tokens in the input sequence

Tokens: Breaking Text into Pieces

Here's a crucial fact: LLMs don't read words. They read tokens.

A token is a piece of text — it could be a whole word, part of a word, or even just a single character. The process of splitting text into tokens is called tokenization.

🎮 Tokenizer Demo

Type text below to see how it's split into tokens:

Tokens:

Hello world ! This is token ization .

Token count: 8

Character count: 35

Notice a few things:

"Hello" is one token, but "tokenization" splits into "token" + "ization"
Spaces are attached to the following word (" world" not "world")
Punctuation gets its own tokens
Common words are usually single tokens; rare words get split

            Why tokens? The model has a fixed vocabulary (usually 32,000 to 200,000 tokens). 
            It can only "understand" these specific pieces. By splitting words into common subpieces, 
            the model can handle ANY word, even ones it never saw during training.
          

Byte Pair Encoding (BPE)

The most common tokenization method is called Byte Pair Encoding. Here's the intuition:

Start with every individual character as a token
Find the most common pair of adjacent tokens
Merge them into a new token
Repeat thousands of times

This is why common words like "the" become single tokens, while rare words get broken down. The tokenizer learns which combinations are most useful!

Tokenization Algorithms

Tokenization is the process of mapping text to sequences of integers from a finite vocabulary V. Formally, a tokenizer is a function:

T: Σ* \to V*

where Σ* is the set of all strings over the input alphabet, and V* is the set of all sequences of vocabulary indices.

Byte Pair Encoding (BPE)

BPE is a subword tokenization algorithm that iteratively merges frequent character pairs. Given a training corpus, BPE:

Initialize vocabulary with all Unicode characters
Count frequency of all adjacent token pairs
Merge the most frequent pair into a new token
Repeat until vocabulary size reaches target (typically 32K-200K)

The merge operations are greedy and deterministic, ensuring encoding is unambiguous. Decoding is simply concatenation of token strings.

            Properties: BPE balances vocabulary size and sequence length. 
            Rare words decompose into character sequences; common words remain atomic. 
            This provides robustness to out-of-vocabulary words while maintaining efficiency.
          

Modern Variants

Algorithm	Used In	Key Feature
BPE	GPT-2, GPT-3, RoBERTa	Byte-level, merges frequent pairs
WordPiece	BERT, DistilBERT	Maximizes likelihood of training data
SentencePiece	T5, LLaMA	Language-agnostic, handles any text
tiktoken	GPT-4, Claude	Fast BPE with regex preprocessing

Embeddings: Turning Tokens into Numbers

Now we have tokens as integers. But neural networks work with continuous numbers, not discrete integers. We need to convert each token into a vector — a list of numbers.

This is called an embedding. Think of it as giving each token its own "address" in a high-dimensional space.

📍 Embedding Visualization

Each word becomes a vector of numbers:

"king"

→

(768 numbers)

"queen"

→

(768 numbers)

💡 The Magic: Similar words end up with similar vectors. "King" and "queen" are close in this space. "Apple" and "orange" are close to each other, but far from "king".

Why 768 Dimensions?

Different models use different embedding sizes:

Small models: 256-512 dimensions
Medium models (GPT-2): 768 dimensions
Large models (GPT-3): 1,536-12,288 dimensions

More dimensions = more capacity to capture nuance, but also more computation.

Analogies in Vector Space

Here's where it gets wild: embeddings capture semantic relationships. The famous example:

            king - man + woman ≈ queen
            
            If you take the vector for "king", subtract "man", and add "woman", 
            you get approximately the vector for "queen"!

This works because the embedding space learned that "gender" is a direction you can move in. "Man → woman" is roughly the same vector as "king → queen" or "actor → actress".

Embedding Layers

An embedding layer is a learnable lookup table E ∈ ℝ^(V×d) where:

V = vocabulary size (number of unique tokens)
d = embedding dimension (model width)
E[i] = the d-dimensional vector for token i

For input token index i, the embedding is retrieved as:

x_embed = E[i] \in ℝ^d

This is equivalent to a one-hot encoding followed by matrix multiplication:

x_embed = e_i^T E

where e_i is the one-hot vector with 1 at position i.

Semantic Structure

The embedding space exhibits linear structure that captures semantic relationships. For word analogies of the form "a is to b as c is to d" (e.g., man:woman::king:queen):

E["king"] - E["man"] + E["woman"] \approx E["queen"]

This suggests the embedding space encodes semantic attributes as directions:

Gender: vector from "man" to "woman"
Plural: vector from "cat" to "cats"
Capital: vector from "france" to "paris"

            Theoretical Interpretation: Skip-gram and related embedding methods 
            implicitly factorize a word-context co-occurrence matrix. The learned embeddings capture 
            PMI (Pointwise Mutual Information) relationships in the training data.
          

Embedding Dimensions in Practice

Model	Parameters	Embedding Dim (d_model)
GPT-2 Small	124M	768
GPT-2 XL	1.5B	1,600
GPT-3	175B	12,288
LLaMA 2	70B	8,192

The Core Task: Predict the Next Token

Now we understand the setup:

Text gets split into tokens
Tokens get converted to embedding vectors
The model processes these vectors...
And outputs a probability distribution over the next token

🎯 Next Token Prediction Demo

Given the context, what comes next?

                "The capital of France is"
              

Predicted next tokens (top 5):

Paris (87%)

the (5%)

located (3%)

called (2%)

Parisian (1%)

Try different prompts:

The model outputs probabilities for every token in its vocabulary (often 50,000+ tokens). "Paris" gets 87% probability, "the" gets 5%, and so on.

How Does It Actually Generate Text?

To generate text, the model repeats this process:

Given the current text, predict next token probabilities
Sample (or pick the highest probability token)
Add that token to the text
Repeat!

This is called autoregressive generation — each new token becomes part of the context for predicting the next one.

            Important Limitation: LLMs don't "know" facts in the way humans do. 
            They predict what text would likely appear next based on patterns in their training data. 
            Sometimes this produces correct answers; sometimes it produces confident-sounding nonsense.
          

Autoregressive Language Modeling

The training objective for autoregressive language models is to maximize the likelihood of the training data under the model's distribution:

L(θ) = Σᵢ log P(xᵢ | x₁, ..., xᵢ₋₁; θ)

where θ represents the model parameters. This is equivalent to minimizing the cross-entropy loss:

H(p, q) = -Σᵢ p(xᵢ) log q(xᵢ)

For language modeling, p is the empirical distribution (one-hot at the true next token) and q is the model's predicted distribution.

Sampling Strategies

At inference time, we sample from the predicted distribution P(xₜ₊₁ | x≤ₜ). Common strategies:

Method	Description	Effect
Greedy	argmax P(x)	Deterministic, often repetitive
Temperature	Sample from P(x)^(1/T)	T < 1: more focused; T > 1: more random
Top-k	Sample from top k tokens only	Prevents very unlikely tokens
Top-p (nucleus)	Sample from smallest set with cumsum(p) ≥ p	Dynamic vocabulary restriction

            Perplexity: The standard evaluation metric is perplexity, defined as 
            exp(-average log likelihood). Lower is better. It can be interpreted as the 
            effective vocabulary size — the model is as uncertain as if choosing uniformly 
            among that many tokens.
          

The Complete Pipeline

Let's put it all together. Here's what happens when you type "Hello" to ChatGPT:

🔄 Full Pipeline Visualization

1. Input Text
"Hello, how are"

↓

2. Tokenize

Hello , how are

↓

3. Embed
Each token → 768-dimensional vector

↓

4. Neural Network Processing
(Transformers — coming in Level 3!)

↓

5. Output Probabilities
" you" (45%), " you?" (30%), " you!" (15%)...

↓

6. Sample & Generate
" you"

This process repeats for every token generated. If the model generates 100 tokens, it runs through this pipeline 100 times!

What About the Neural Network?

We've glossed over step 4 — the actual neural network. That's the heart of the model, and it's what we'll explore in the next levels:

Level 2: Neural Networks — how individual neurons work and learn
Level 3: Transformers — the architecture powering modern LLMs
Level 4: Training — how these models actually learn from data
Level 5: The Math — the underlying mathematics

End-to-End Architecture

A language model is a composition of functions:

P(xₜ₊₁ | x\leqₜ) = softmax(W_out \cdot Transformer(Embedding(x\leqₜ)))

Where:

Embedding: V → ℝ^(T×d) maps token indices to vectors
Transformer: ℝ^(T×d) → ℝ^(T×d) processes the sequence
W_out: ℝ^(d×V) projects to vocabulary logits
softmax: converts logits to probabilities

For autoregressive generation, we apply causal masking so position i only attends to positions ≤ i.

Computational Complexity

The forward pass for a transformer with L layers, sequence length T, and dimension d:

Attention: O(T² · d) — quadratic in sequence length
Feedforward: O(T · d²) — linear in sequence length
Total: O(L · T · d · (T + d))

            Memory Bottleneck: The T² attention term dominates for long sequences. 
            This motivates research into efficient attention variants (sparse attention, linear attention, 
            Flash Attention optimizations).
          

Vocabulary: The Model's Dictionary

Every LLM has a fixed vocabulary — a list of every token it knows. This vocabulary is a crucial design decision that affects everything else.

Vocabulary Sizes Across Models

Model	Vocab Size	Tokenizer	Avg Tokens per Word
GPT-2	50,257	BPE	~1.3
GPT-3/4	100,277	tiktoken (BPE)	~1.1
LLaMA 2	32,000	SentencePiece	~1.4
LLaMA 3	128,000	tiktoken (BPE)	~1.0
Claude	~100,000	tiktoken (BPE)	~1.1

The Vocabulary Size Trade-off

Choosing the vocabulary size is a balancing act:

Small vocabulary (10K): Model is compact, but every word gets split into many sub-pieces. "Unbelievable" might become 4+ tokens, making sequences long.
Medium vocabulary (32K-50K): Good balance. Common words are single tokens, rare words split into 2-3 sub-pieces.
Large vocabulary (100K-256K): More words are single tokens (shorter sequences), but the embedding matrix is larger, and the output softmax has more options.

            Why It Matters: The output layer of an LLM must produce a probability for every token in the vocabulary. 
            With 100,000 tokens, that's 100,000 probabilities computed at every single generation step! This is why the vocabulary size 
            directly affects both memory usage and computational cost.
          

Subword Tokenization Saves the Day

The genius of subword tokenization (BPE, WordPiece) is that it handles any input while keeping the vocabulary manageable:

Common words → single token ("the", "and", "is")
Common subwords → single token ("un", "ing", "tion")
Rare words → split into subword tokens ("antidisestablishmentarianism" → "anti" + "dis" + "establish" + "ment" + "arian" + "ism")
Unknown words → always decomposable into characters (guaranteed by byte-level BPE)

Vocabulary Construction and Properties

The vocabulary V defines the set of all tokens the model can produce. Its size |V| affects both the embedding matrix E ∈ ℝ^(|V|×d) and the output projection W_out ∈ ℝ^(d×|V|).

Vocabulary Size and Model Parameters

The total parameter count includes vocabulary-dependent terms:

|params| = |E| + |transformer| + |W_out| \approx 2|V|d + |transformer|

For LLaMA 2 70B with |V| = 32,000 and d = 8,192:

Embedding: 32,000 × 8,192 = 262M parameters
Output projection: 8,192 × 32,000 = 262M parameters
Vocabulary overhead: ~524M parameters (~0.7% of total)

For larger vocabularies (128K), this overhead grows to ~2B parameters — still modest relative to the transformer layers, but the computational cost of softmax over 128K classes is significant.

BPE Merge Algorithm — Formal Specification

Given a corpus C and target vocabulary size V_target:

Initialize vocabulary V₀ from all character-level tokens
For each step t = 1, 2, ..., until |V_t| reaches target:
- Count all adjacent pairs (a, b) across the corpus
- Find the pair (a*, b*) with maximum count: (a*, b*) = argmax count(a, b)
- Merge all occurrences: replace "a b" with "ab" throughout corpus
- Add "ab" to vocabulary: V_{t+1} = V_t ∪ {"ab"}
Record the sequence of mergers M = [(a₁,b₁), (a₂,b₂), ...] for encoding

            Time Complexity: Training BPE on a corpus of N tokens with V merges requires O(NV) 
            time with naive implementation, but can be optimized to O(N log V) using priority queues. 
            Encoding new text is deterministic: apply mergers in order.
          

How LLMs Choose the Next Word

Once the model produces probabilities for every token, how does it pick one? This is where sampling strategies come in — and they dramatically affect the quality and personality of the output.

Greedy Decoding

The simplest approach: always pick the most probable token.

Greedy Decoding Example

              Input: "The weather today is"

              Step 1: Pick " beautiful" (45%)

              Step 2: Pick " and" (52%)

              Step 3: Pick " sunny" (61%)

              Step 4: Pick "." (88%)

Result: "The weather today is beautiful and sunny." — Correct but boring!

Greedy decoding often produces repetitive, generic text. The model gets stuck in loops because it always makes the "safest" choice.

Temperature Sampling

Temperature controls how "creative" vs "safe" the model is:

Temperature Effect on Probabilities

T = 0.1
Nearly greedy

                  "Paris" → 98%

                  "London" → 1%

                  "Berlin" → 0.3%

T = 1.0
Default/normal

                  "Paris" → 87%

                  "London" → 5%

                  "Berlin" → 3%

T = 2.0
Very random

                  "Paris" → 35%

                  "London" → 20%

                  "Berlin" → 15%

Low temperature (0.1-0.5): Model is focused and deterministic. Good for code, factual answers.
Medium temperature (0.7-1.0): Balanced. Good for conversation, general writing.
High temperature (1.5-2.0): Creative and unpredictable. Good for brainstorming, poetry. Can produce nonsense.

Top-K and Top-P (Nucleus) Sampling

Temperature alone isn't enough — we also need to prevent the model from picking truly bizarre tokens:

Top-K sampling: Only consider the K most likely tokens. K=50 means the model can only pick from the top 50 options.
Top-P (nucleus) sampling: Consider the smallest set of tokens whose cumulative probability ≥ P. P=0.9 means pick from tokens that together cover 90% of the probability mass.

            In Practice: Most modern LLM APIs use temperature + top-p together. 
            A typical configuration: temperature=0.7, top_p=0.9. This gives creative but coherent output 
            while filtering out extremely unlikely tokens.
          

Sampling Strategies — Formal Treatment

Given the model's output distribution P(x_{t+1} | x_{<=t}), sampling strategies define how we select the next token.

Temperature Scaling

Temperature τ modifies the distribution by scaling logits before softmax:

P_τ(x_i | context) = exp(z_i / τ) / Σ_j exp(z_j / τ)

Where z_i are the raw logits (unnormalized scores). As τ → 0, the distribution approaches a delta function on the argmax (greedy). As τ → ∞, it approaches a uniform distribution.

Top-K Sampling

Restrict sampling to the top K tokens:

V_K = {top K tokens by probability} P_K(x_i | context) = P(x_i | context) / Σ_{j ∈ V_K} P(x_j | context)    if i ∈ V_K P_K(x_i | context) = 0    if i ∉ V_K

Nucleus (Top-P) Sampling

Select the minimal set of tokens whose cumulative probability exceeds p:

V_p = minimal {x_{(1)}, x_{(2)}, ..., x_{(k)}} s.t. Σᵢ₌₁ᵏ P(x_{(i)} | context) ≥ p

where x_{(i)} are tokens sorted in decreasing order of probability. This adapts dynamically: for concentrated distributions, few tokens are considered; for flat distributions, many are included.

Repetition Penalty

To further reduce repetition, a penalty is applied to already-generated tokens:

z_i' = z_i - penalty if token i has been generated before

This is equivalent to lowering the logit of previously-seen tokens, making the model prefer fresh tokens over repetitions.

The Context Window: How Much Can It Remember?

Every LLM has a fixed context window — the maximum number of tokens it can process at once. Think of it as the model's "working memory."

Context Window Sizes

Model	Context Window	Approximate Pages
Original GPT-2	1,024 tokens	~2 pages
GPT-3	4,096 tokens	~8 pages
GPT-4 Turbo	128,000 tokens	~256 pages
Gemini 1.5 Pro	1,000,000+ tokens	~2,000 pages

Why Can't We Just Make It Infinite?

The context window is limited by a fundamental mathematical property of the transformer: self-attention scales quadratically with sequence length.

2K context → ~4M attention computations per layer
8K context → ~64M attention computations per layer (16× more)
128K context → ~16B attention computations per layer (4000× more)
1M context → ~1T attention computations per layer (250,000× more)

Every time you double the context window, attention computation increases by 4×. This is why extending context is one of the hardest engineering challenges in AI.

            Context ≠ Memory: Having a large context window doesn't mean the model 
            "remembers" everything equally. Models tend to pay more attention to the beginning and end 
            of long contexts — a phenomenon called the "lost in the middle" effect.
          

Context Window — Computational Analysis

The context window length T directly constrains the maximum input + output sequence. For a transformer with L layers, hidden dimension d, and sequence length T:

Attention: O(L \cdot T² \cdot d) — quadratic in T Feed-forward: O(L \cdot T \cdot d²) — linear in T Total: O(L \cdot T \cdot d \cdot (T + d))

Memory Requirements

During inference, we must cache the Key and Value matrices for all previous positions:

KV Cache Size = 2 \cdot L \cdot n_heads \cdot d_head \cdot T \cdot sizeof(float16) = 2 \cdot 96 \cdot 96 \cdot 128 \cdot T \cdot 2 bytes (GPT-3 scale) = 4.7MB per token of context

For a 128K token context window, this requires approximately 600GB of KV cache memory alone.

Extending Context

Several approaches exist to handle longer contexts:

Method	Approach	Trade-off
Rotary Position Embeddings (RoPE)	Interpolate position frequencies	May lose fine positional info
ALiBi	Add linear bias to attention scores	No positional embeddings needed
Flash Attention	Optimized memory-efficient attention	Same computation, less memory
Sliding Window	Local attention with fixed window	Loses global context
Sparse Attention	Attend to subset of positions	Approximate, not exact

How Do We Know If the Model Is Good?

To train a model, we need a way to measure how wrong its predictions are. This measurement is called the loss function — and it's the compass that guides training.

Cross-Entropy Loss (Intuitive)

The most common loss function for language models is cross-entropy. Here's the intuition:

Understanding Cross-Entropy

Imagine the model sees "The cat sat on the" and must predict "mat".

Good model: Assigns "mat" probability 0.8 → Loss = -log(0.8) = 0.32 (low loss)

Okay model: Assigns "mat" probability 0.3 → Loss = -log(0.3) = 1.20 (medium loss)

Bad model: Assigns "mat" probability 0.01 → Loss = -log(0.01) = 4.61 (high loss)

The loss is -log(probability). When the model is confident and right, loss is low. When the model is confident and wrong, loss is very high.

Training minimizes this loss across all training examples. When the loss is minimized, the model is assigning high probability to the correct next token.

            Why -log? The logarithm turns products (of many probabilities) into sums 
            (easier to optimize), and the negative sign makes "better predictions" produce smaller numbers. 
            -log(p) goes to 0 as p goes to 1, and goes to infinity as p goes to 0.
          

Cross-Entropy Loss — Formal Definition

For a sequence of tokens x₁, x₂, ..., x_T, the cross-entropy loss is:

L(θ) = -(1/T) Σᵢ₌₁ᵀ log P(xᵢ | x₁, ..., xᵢ₋₁; θ)

This averages the negative log-probability the model assigns to each correct token, conditioned on all previous tokens.

Equivalence to Cross-Entropy

For each position i, let p_i be the one-hot vector at position xᵢ (the true token) and q_i be the model's predicted distribution. Then:

L_i = -Σⱼ p_ij · log q_ij = -log q_{i, xᵢ}

Since p_i is one-hot (only 1 at the true token position), all terms vanish except the one corresponding to the actual next token. This is why language model training uses the term "cross-entropy loss" interchangeably with "negative log-likelihood."

Perplexity

Perplexity is the exponential of average cross-entropy:

Perplexity = exp(L) = exp(-(1/T) Σᵢ log P(xᵢ | x_{ Perplexity has an intuitive interpretation: it's the model's effective branching factor. If perplexity is 15, the model is as uncertain as if choosing uniformly among 15 tokens at each step. Typical Perplexity Values: • GPT-2: ~30-40 on web text • GPT-3: ~15-20 on web text • GPT-4: ~10-15 on web text • Random baseline: ~|V| (50,000+)

How Does Training Actually Work?

Training an LLM means adjusting billions of parameters so the model assigns high probability to good text and low probability to bad text. Here's the process at a high level:

The Training Loop

1

Get a batch of text — e.g., 512 random documents from the training data

2

Forward pass — run the text through the model, get predictions for each position

3

Compute loss — compare predictions to actual next tokens, compute cross-entropy

4

Backward pass — compute gradients (how much each parameter contributed to the loss)

5

Update parameters — nudge each parameter slightly in the direction that reduces loss

This loop runs millions of times during training, on trillions of tokens, across thousands of GPUs. Each step nudges the parameters slightly, and over time, the model learns to predict text better and better.

The Scale of Training

Training Scale Comparison

Model	Parameters	Training Data	GPU Hours	Estimated Cost
GPT-2 (2019)	1.5B	40GB text	~256 V100-years	~$50K
GPT-3 (2020)	175B	570GB text	~355,000 V100-years	~$4.6M
LLaMA 2 70B (2023)	70B	2T tokens	~1,720,000 A100-hours	~$2-5M
GPT-4 (2023)	~1.8T (est.)	~13T tokens (est.)	~tens of millions GPU-hours	~$100M+ (est.)

            The Bitter Lesson: The single most important factor in AI progress has been 
            scale — more data, more compute, more parameters. As Richard Sutton wrote in 2019: 
            "The biggest lesson that can be read from 70 years of AI research is that general methods 
            that leverage computation are ultimately the most effective."
          

Training — Formal Framework

The training objective for autoregressive language models is maximum likelihood estimation (MLE):

θ* = argmax_θ Σ_{(x₁,...,x_T) ∈ D} Σᵢ₌₁ᵀ log P(xᵢ | x_{ This is equivalent to minimizing the average cross-entropy loss over the training corpus D. Stochastic Gradient Descent Computing the gradient over the entire dataset is infeasible. Instead, we estimate it from small batches: θ_{t+1} = θ_t - η · (1/|B|) Σ_{i ∈ B} ∇_θ L(x_i; θ_t) Where B is a minibatch of training sequences. The batch size B is a critical hyperparameter: Small batch (32-128): Noisy but fast updates, better generalization Medium batch (512-2048): Good balance of stability and speed Large batch (4096-4M): Parallelizable across many GPUs, but may generalize worse Learning Rate Schedule The learning rate η is not constant during training. Common schedules: Warmup: Start with η ≈ 0 and linearly increase over first N steps Cosine decay: η_t = η_min + 0.5(η_max - η_min)(1 + cos(πt/T)) Linear decay: η_t = η_max · (1 - t/T) Why warmup? Early in training, the model's parameters are random. Large gradients can destabilize training. Warmup allows the model to settle into a reasonable region of parameter space before taking bigger steps.