Summary of “Post-training of LLMs” (Banghua Zhu, DeepLearning.AI)
The training of large language models (LLMs) happens in two phases:
1. Pre-training
-
The model learns to predict the next token using massive datasets (trillions of tokens).
-
This is the most expensive and time-consuming phase—can take months for large models.
2. Post-training
-
The model is further trained to perform specific tasks (e.g., answering questions).
-
Uses much smaller datasets, making it faster and cheaper.
-
This course focuses on three widely used post-training techniques, and you'll apply them to small Qwen models.
Three Post-training Techniques Covered
1. Supervised Fine-Tuning (SFT)
-
The model learns from labeled prompt → response examples.
-
Effective for teaching new behaviors or making significant changes.
-
Course exercise: fine-tune a small Qwen model for instruction following.
2. Direct Preference Optimization (DPO)
-
The model is shown two responses to the same prompt: one good, one bad.
-
The loss function moves the model toward the preferred response and away from the bad one.
-
Example: Changing the model’s identity string from “I’m your assistant” to “I’m your AI assistant.”
-
Course exercise: apply DPO to modify a Qwen instruct model’s identity.
3. Online Reinforcement Learning (RL)
-
The model generates responses to prompts.
-
A reward function scores them.
-
The model updates itself based on reward.
-
Rewards can come from:
-
Human judgments → Train a reward model using human-labeled quality scores.
-
Verifiable rewards for tasks with objective correctness, like math or coding.
-
Use math checkers or unit tests.
-
-
-
Algorithms:
-
PPO (Proximal Policy Optimization) – widely used.
-
GRPO (Group Relative Policy Optimization) – from DeepSeek; powerful for verifiable tasks.
-
-
Course exercise: Use GRPO to train a Qwen model on math problems.