Lesson 4: Context Windows

What is a Context Window?

A language model doesn't have infinite memory. It can only "see" a fixed number of tokens at once — this is called the context window (or context length).

        Context Window: The maximum number of tokens the model can process in a single forward pass. 
        This includes both your input (prompt) and the model's output (completion).
      

Visualizing the Context Window

Context Window Breakdown

Total Context: 4,096 tokens

System: You are a helpful assistant... User: Can you summarize this article? [Article text... ~2000 tokens] Assistant: Here's a summary... [Generating...]

0 tokens 3,481 used 4,096 max

615 tokens remaining for the model's response

When the context window fills up, the model can't see anything beyond it. This is why very long conversations can "forget" things from the beginning.

Context Window Sizes Across Models

Different models have different context window sizes. This has been one of the key areas of improvement:

Context Window Evolution

GPT-3

2K

tokens

GPT-4

8K

or 32K tokens

Claude 2

100K

tokens

Claude 3

200K

tokens

Gemini 1.5

1M

tokens

Llama 2

4K

tokens

What Can You Fit?

Context Size	Roughly Equivalent To
4,096 tokens	~3 pages of text, or a long essay
32,768 tokens	~24 pages, or a short paper
100,000 tokens	~75 pages, or a novella
1,000,000 tokens	~750 pages, or a long novel

Why Context Windows Are Limited

You might wonder: why not just make context windows infinite? There are several challenges:

1. Computational Cost (The Attention Problem)

Remember the attention mechanism from Level 3? It computes relationships between every pair of tokens. This means:

Attention computation: O(n²) For n = 4,000 tokens: 16 million operations For n = 1,000,000 tokens: 1 trillion operations!

The attention computation grows quadratically with sequence length. This is why very long contexts are computationally expensive.

2. Memory Requirements

Storing attention weights for long sequences requires a lot of memory:

4K tokens: ~64 MB of attention weights
100K tokens: ~40 GB of attention weights
1M tokens: ~4 TB of attention weights!

3. The "Lost in the Middle" Problem

Research shows that even with large context windows, models perform worse on information in the middle of long documents. They tend to focus on the beginning and end.

        Key Finding: In a 100K token context, models often perform as if the context 
        was only 50K tokens — the middle 50K gets "lost" or attended to less effectively.
      

Strategies for Long Documents

What do you do when your document is longer than the context window? Here are common strategies:

1. Chunking

Split the document into smaller chunks and process each separately:

# Chunking strategy
def chunk_document(text, chunk_size=1000, overlap=100):
    """Split text into overlapping chunks"""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap to maintain context
    
    return chunks
      

2. Retrieval-Augmented Generation (RAG)

Instead of feeding the entire document, retrieve only the relevant parts:

Split documents into chunks and store in a vector database
When user asks a question, find the most relevant chunks
Only include those chunks in the context

        RAG is the dominant approach for working with large document collections. 
        It combines the broad knowledge of LLMs with targeted retrieval of relevant information.
      

3. Hierarchical Summarization

For very long documents, use multiple passes:

Summarize each section independently
Combine section summaries
Summarize the combined summary if needed

4. Sliding Window Attention

Some models use modified attention mechanisms that don't attend to all tokens:

Sliding window: Only attend to nearby tokens
Sparse attention: Attend to specific patterns (e.g., every 100th token)
Linear attention: Approximate full attention more efficiently

Context Window Best Practices

Practical Guidelines

✓ DO: Put the most important instructions at the end of the prompt (just before the response)

✓ DO: Reserve ~20% of context for the model's response

✓ DO: Use RAG for documents longer than the context window

✗ DON'T: Assume the model can perfectly recall information from the middle of long contexts

✗ DON'T: Stuff the entire context window without leaving room for the response

✗ DON'T: Forget that system messages and conversation history count toward the limit

Exercises

Exercise 1: Context Budgeting

You have a 4K context window. Your system message is 100 tokens. You want the model to respond with at least 500 tokens. How many tokens can your user input be?

Exercise 2: Chunking Strategy

You have a 50,000 token document and a 4K context window. If you use chunks of 1,000 tokens with 100 token overlap, how many chunks will you have? How many total tokens will be in all chunks combined?

Exercise 3: Cost Analysis

If attention scales as O(n²), how much more expensive is processing 8K tokens compared to 4K tokens? What about 32K vs 4K?