Lesson 3: Fine-tuning

What is Fine-tuning?

Take a pre-trained model and continue training on a specific task:

# Load pre-trained model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Continue training on your data
trainer = Trainer(model=model, train_dataset=my_data)
trainer.train()
      

Types of Fine-tuning

Full fine-tuning: Update all parameters
LoRA: Train low-rank adapters
Prompt tuning: Learn soft prompts
Layer freezing: Only train top layers

Why Fine-tuning Works

        Pre-trained models already know language structure, grammar, and facts. 
        Fine-tuning just adapts this knowledge to your specific task or format.
      

📝 Practice Exercises

Exercise 1: Implement LoRA Fine-tuning

Use the PEFT library to fine-tune a model with LoRA. Your task:

Load GPT-2 from Hugging Face
Configure LoRA with r=8, alpha=32, targeting query/value layers
Set up training for a text classification task

from peft import LoraConfig, get_peft_model
from transformers import GPT2ForSequenceClassification

# Your solution here:
model = GPT2ForSequenceClassification.from_pretrained("gpt2")

# 1. Create LoRA config
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["c_attn"],  # GPT-2's attention layers
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

# 2. Apply LoRA to model
model = get_peft_model(model, lora_config)

# 3. Verify trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 294,912 || all params: 124,709,376 || trainable%: 0.2365
        

Key insight: LoRA reduces trainable parameters to ~0.2% while maintaining performance!

Exercise 2: Layer Freezing Strategy

Implement layer-wise freezing for efficient fine-tuning:

from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")

# Task: Freeze all layers except the last 2 transformer blocks
# GPT-2 has 12 layers (0-11). Unfreeze only layers 10 and 11.

# Freeze all parameters first
for param in model.parameters():
    param.requires_grad = False

# Unfreeze the last 2 transformer layers
for i in [10, 11]:
    for param in model.transformer.h[i].parameters():
        param.requires_grad = True

# Unfreeze the language model head
for param in model.lm_head.parameters():
    param.requires_grad = True

# Verify: Count trainable vs frozen
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
        

Expected output: Trainable: ~20M / 124M parameters (16%)

🧠 Knowledge Check

Test your understanding of fine-tuning concepts:

Question 1

What is the main advantage of using LoRA (Low-Rank Adaptation) for fine-tuning?

A) It trains all model parameters for maximum accuracy
B) It reduces trainable parameters to ~0.2% while maintaining performance
C) It requires no pre-trained model
D) It only works with GPT-2

Answer: B — LoRA adds small trainable adapter matrices instead of updating all weights, making fine-tuning efficient and fast.

Question 2

In layer freezing, why would you freeze early transformer layers and only train the top layers?

A) Early layers contain low-level features (syntax, grammar) that transfer well
B) Early layers are always corrupted
C) It's required by the PEFT library
D) Top layers learn nothing during pre-training

Answer: A — Early layers capture general language patterns that are useful across tasks, while top layers capture task-specific patterns that need adaptation.

Question 3

What does 'transfer learning' mean in the context of fine-tuning LLMs?

A) Transferring model weights between different cloud providers
B) Transferring knowledge learned from one task to a different but related task
C) Copying the training dataset to another location
D) Moving the model from GPU to CPU

Answer: B — Transfer learning leverages knowledge gained during pre-training (general language understanding) and applies it to a specific downstream task through fine-tuning.

Question 4

Which fine-tuning method would be most appropriate if you have limited GPU memory (8GB) but need to adapt a 7B parameter model?

A) Full fine-tuning of all parameters
B) LoRA or QLoRA
C) Training from scratch
D) No fine-tuning, use prompt engineering only

Answer: B — LoRA/QLoRA dramatically reduces memory requirements by training only small adapter layers, making it feasible to fine-tune large models on consumer GPUs.