🚧 Lesson 4 of 10 in Level 01
Level 01 • Lesson 4

Context Windows

Understanding how much text LLMs can process, context window sizes, and strategies for long documents.

What is a Context Window?

A language model doesn't have infinite memory. It can only "see" a fixed number of tokens at once — this is called the context window (or context length).

Context Window: The maximum number of tokens the model can process in a single forward pass. This includes both your input (prompt) and the model's output (completion).

Visualizing the Context Window

Context Window Breakdown

Total Context: 4,096 tokens
System: You are a helpful assistant... User: Can you summarize this article? [Article text... ~2000 tokens] Assistant: Here's a summary... [Generating...]
0 tokens 3,481 used 4,096 max

615 tokens remaining for the model's response

When the context window fills up, the model can't see anything beyond it. This is why very long conversations can "forget" things from the beginning.

Context Window Sizes Across Models

Different models have different context window sizes. This has been one of the key areas of improvement:

Context Window Evolution

GPT-3
2K
tokens
GPT-4
8K
or 32K tokens
Claude 2
100K
tokens
Claude 3
200K
tokens
Gemini 1.5
1M
tokens
Llama 2
4K
tokens

What Can You Fit?

Context Size Roughly Equivalent To
4,096 tokens ~3 pages of text, or a long essay
32,768 tokens ~24 pages, or a short paper
100,000 tokens ~75 pages, or a novella
1,000,000 tokens ~750 pages, or a long novel

Why Context Windows Are Limited

You might wonder: why not just make context windows infinite? There are several challenges:

1. Computational Cost (The Attention Problem)

Remember the attention mechanism from Level 3? It computes relationships between every pair of tokens. This means:

Attention computation: O(n²)
For n = 4,000 tokens: 16 million operations
For n = 1,000,000 tokens: 1 trillion operations!

The attention computation grows quadratically with sequence length. This is why very long contexts are computationally expensive.

2. Memory Requirements

Storing attention weights for long sequences requires a lot of memory:

3. The "Lost in the Middle" Problem

Research shows that even with large context windows, models perform worse on information in the middle of long documents. They tend to focus on the beginning and end.

Key Finding: In a 100K token context, models often perform as if the context was only 50K tokens — the middle 50K gets "lost" or attended to less effectively.

Strategies for Long Documents

What do you do when your document is longer than the context window? Here are common strategies:

1. Chunking

Split the document into smaller chunks and process each separately:

# Chunking strategy def chunk_document(text, chunk_size=1000, overlap=100): """Split text into overlapping chunks""" chunks = [] start = 0 while start < len(text): end = start + chunk_size chunk = text[start:end] chunks.append(chunk) start = end - overlap # Overlap to maintain context return chunks

2. Retrieval-Augmented Generation (RAG)

Instead of feeding the entire document, retrieve only the relevant parts:

  1. Split documents into chunks and store in a vector database
  2. When user asks a question, find the most relevant chunks
  3. Only include those chunks in the context
RAG is the dominant approach for working with large document collections. It combines the broad knowledge of LLMs with targeted retrieval of relevant information.

3. Hierarchical Summarization

For very long documents, use multiple passes:

  1. Summarize each section independently
  2. Combine section summaries
  3. Summarize the combined summary if needed

4. Sliding Window Attention

Some models use modified attention mechanisms that don't attend to all tokens:

Context Window Best Practices

Practical Guidelines

āœ“ DO: Put the most important instructions at the end of the prompt (just before the response)
āœ“ DO: Reserve ~20% of context for the model's response
āœ“ DO: Use RAG for documents longer than the context window
āœ— DON'T: Assume the model can perfectly recall information from the middle of long contexts
āœ— DON'T: Stuff the entire context window without leaving room for the response
āœ— DON'T: Forget that system messages and conversation history count toward the limit

Exercises

Exercise 1: Context Budgeting

You have a 4K context window. Your system message is 100 tokens. You want the model to respond with at least 500 tokens. How many tokens can your user input be?

Exercise 2: Chunking Strategy

You have a 50,000 token document and a 4K context window. If you use chunks of 1,000 tokens with 100 token overlap, how many chunks will you have? How many total tokens will be in all chunks combined?

Exercise 3: Cost Analysis

If attention scales as O(n²), how much more expensive is processing 8K tokens compared to 4K tokens? What about 32K vs 4K?