🚧 Lesson 3 of 35 in Level 04
Level 04 • Lesson 3

Fine-tuning

How to adapt pre-trained models to specific tasks. Transfer learning.

What is Fine-tuning?

Take a pre-trained model and continue training on a specific task:

# Load pre-trained model model = GPT2LMHeadModel.from_pretrained("gpt2") # Continue training on your data trainer = Trainer(model=model, train_dataset=my_data) trainer.train()

Types of Fine-tuning

Why Fine-tuning Works

Pre-trained models already know language structure, grammar, and facts. Fine-tuning just adapts this knowledge to your specific task or format.

📝 Practice Exercises

Exercise 1: Implement LoRA Fine-tuning

Use the PEFT library to fine-tune a model with LoRA. Your task:

  • Load GPT-2 from Hugging Face
  • Configure LoRA with r=8, alpha=32, targeting query/value layers
  • Set up training for a text classification task
from peft import LoraConfig, get_peft_model from transformers import GPT2ForSequenceClassification # Your solution here: model = GPT2ForSequenceClassification.from_pretrained("gpt2") # 1. Create LoRA config lora_config = LoraConfig( r=8, lora_alpha=32, target_modules=["c_attn"], # GPT-2's attention layers lora_dropout=0.1, bias="none", task_type="SEQ_CLS" ) # 2. Apply LoRA to model model = get_peft_model(model, lora_config) # 3. Verify trainable parameters model.print_trainable_parameters() # Output: trainable params: 294,912 || all params: 124,709,376 || trainable%: 0.2365

Key insight: LoRA reduces trainable parameters to ~0.2% while maintaining performance!

Exercise 2: Layer Freezing Strategy

Implement layer-wise freezing for efficient fine-tuning:

from transformers import GPT2LMHeadModel model = GPT2LMHeadModel.from_pretrained("gpt2") # Task: Freeze all layers except the last 2 transformer blocks # GPT-2 has 12 layers (0-11). Unfreeze only layers 10 and 11. # Freeze all parameters first for param in model.parameters(): param.requires_grad = False # Unfreeze the last 2 transformer layers for i in [10, 11]: for param in model.transformer.h[i].parameters(): param.requires_grad = True # Unfreeze the language model head for param in model.lm_head.parameters(): param.requires_grad = True # Verify: Count trainable vs frozen trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")

Expected output: Trainable: ~20M / 124M parameters (16%)

🧠 Knowledge Check

Test your understanding of fine-tuning concepts:

Question 1

What is the main advantage of using LoRA (Low-Rank Adaptation) for fine-tuning?

  • A) It trains all model parameters for maximum accuracy
  • B) It reduces trainable parameters to ~0.2% while maintaining performance
  • C) It requires no pre-trained model
  • D) It only works with GPT-2

Answer: B — LoRA adds small trainable adapter matrices instead of updating all weights, making fine-tuning efficient and fast.

Question 2

In layer freezing, why would you freeze early transformer layers and only train the top layers?

  • A) Early layers contain low-level features (syntax, grammar) that transfer well
  • B) Early layers are always corrupted
  • C) It's required by the PEFT library
  • D) Top layers learn nothing during pre-training

Answer: A — Early layers capture general language patterns that are useful across tasks, while top layers capture task-specific patterns that need adaptation.

Question 3

What does 'transfer learning' mean in the context of fine-tuning LLMs?

  • A) Transferring model weights between different cloud providers
  • B) Transferring knowledge learned from one task to a different but related task
  • C) Copying the training dataset to another location
  • D) Moving the model from GPU to CPU

Answer: B — Transfer learning leverages knowledge gained during pre-training (general language understanding) and applies it to a specific downstream task through fine-tuning.

Question 4

Which fine-tuning method would be most appropriate if you have limited GPU memory (8GB) but need to adapt a 7B parameter model?

  • A) Full fine-tuning of all parameters
  • B) LoRA or QLoRA
  • C) Training from scratch
  • D) No fine-tuning, use prompt engineering only

Answer: B — LoRA/QLoRA dramatically reduces memory requirements by training only small adapter layers, making it feasible to fine-tune large models on consumer GPUs.