LLM, et al: Post-training of LLMs

Summary of “Post-training of LLMs” (Banghua Zhu, DeepLearning.AI)

The training of large language models (LLMs) happens in two phases:

The model learns to predict the next token using massive datasets (trillions of tokens).
This is the most expensive and time-consuming phase—can take months for large models.

The model is further trained to perform specific tasks (e.g., answering questions).
Uses much smaller datasets, making it faster and cheaper.
This course focuses on three widely used post-training techniques, and you'll apply them to small Qwen models.

The model is shown two responses to the same prompt: one good, one bad.
The loss function moves the model toward the preferred response and away from the bad one.
Example: Changing the model’s identity string from “I’m your assistant” to “I’m your AI assistant.”
Course exercise: apply DPO to modify a Qwen instruct model’s identity.

The model generates responses to prompts.
A reward function scores them.
The model updates itself based on reward.
Rewards can come from:
- Human judgments → Train a reward model using human-labeled quality scores.
- Verifiable rewards for tasks with objective correctness, like math or coding.
  - Use math checkers or unit tests.
Algorithms:
- PPO (Proximal Policy Optimization) – widely used.
- GRPO (Group Relative Policy Optimization) – from DeepSeek; powerful for verifiable tasks.
Course exercise: Use GRPO to train a Qwen model on math problems.