What is a Context Window?
A language model doesn't have infinite memory. It can only "see" a fixed number of tokens at once ā this is called the context window (or context length).
Visualizing the Context Window
Context Window Breakdown
615 tokens remaining for the model's response
When the context window fills up, the model can't see anything beyond it. This is why very long conversations can "forget" things from the beginning.
Context Window Sizes Across Models
Different models have different context window sizes. This has been one of the key areas of improvement:
Context Window Evolution
What Can You Fit?
| Context Size | Roughly Equivalent To |
|---|---|
| 4,096 tokens | ~3 pages of text, or a long essay |
| 32,768 tokens | ~24 pages, or a short paper |
| 100,000 tokens | ~75 pages, or a novella |
| 1,000,000 tokens | ~750 pages, or a long novel |
Why Context Windows Are Limited
You might wonder: why not just make context windows infinite? There are several challenges:
1. Computational Cost (The Attention Problem)
Remember the attention mechanism from Level 3? It computes relationships between every pair of tokens. This means:
For n = 4,000 tokens: 16 million operations
For n = 1,000,000 tokens: 1 trillion operations!
The attention computation grows quadratically with sequence length. This is why very long contexts are computationally expensive.
2. Memory Requirements
Storing attention weights for long sequences requires a lot of memory:
- 4K tokens: ~64 MB of attention weights
- 100K tokens: ~40 GB of attention weights
- 1M tokens: ~4 TB of attention weights!
3. The "Lost in the Middle" Problem
Research shows that even with large context windows, models perform worse on information in the middle of long documents. They tend to focus on the beginning and end.
Strategies for Long Documents
What do you do when your document is longer than the context window? Here are common strategies:
1. Chunking
Split the document into smaller chunks and process each separately:
2. Retrieval-Augmented Generation (RAG)
Instead of feeding the entire document, retrieve only the relevant parts:
- Split documents into chunks and store in a vector database
- When user asks a question, find the most relevant chunks
- Only include those chunks in the context
3. Hierarchical Summarization
For very long documents, use multiple passes:
- Summarize each section independently
- Combine section summaries
- Summarize the combined summary if needed
4. Sliding Window Attention
Some models use modified attention mechanisms that don't attend to all tokens:
- Sliding window: Only attend to nearby tokens
- Sparse attention: Attend to specific patterns (e.g., every 100th token)
- Linear attention: Approximate full attention more efficiently
Context Window Best Practices
Practical Guidelines
Exercises
Exercise 1: Context Budgeting
You have a 4K context window. Your system message is 100 tokens. You want the model to respond with at least 500 tokens. How many tokens can your user input be?
Exercise 2: Chunking Strategy
You have a 50,000 token document and a 4K context window. If you use chunks of 1,000 tokens with 100 token overlap, how many chunks will you have? How many total tokens will be in all chunks combined?
Exercise 3: Cost Analysis
If attention scales as O(n²), how much more expensive is processing 8K tokens compared to 4K tokens? What about 32K vs 4K?