Post-training of LLMs - Introduction

 

Summary of “Post-training of LLMs” (Banghua Zhu, DeepLearning.AI)

The training of large language models (LLMs) happens in two phases:

1. Pre-training

  • The model learns to predict the next token using massive datasets (trillions of tokens).

  • This is the most expensive and time-consuming phase—can take months for large models.

2. Post-training

  • The model is further trained to perform specific tasks (e.g., answering questions).

  • Uses much smaller datasets, making it faster and cheaper.

  • This course focuses on three widely used post-training techniques, and you'll apply them to small Qwen models.


Three Post-training Techniques Covered

1. Supervised Fine-Tuning (SFT)

  • The model learns from labeled prompt → response examples.

  • Effective for teaching new behaviors or making significant changes.

  • Course exercise: fine-tune a small Qwen model for instruction following.

2. Direct Preference Optimization (DPO)

  • The model is shown two responses to the same prompt: one good, one bad.

  • The loss function moves the model toward the preferred response and away from the bad one.

  • Example: Changing the model’s identity string from “I’m your assistant” to “I’m your AI assistant.”

  • Course exercise: apply DPO to modify a Qwen instruct model’s identity.

3. Online Reinforcement Learning (RL)

  • The model generates responses to prompts.

  • A reward function scores them.

  • The model updates itself based on reward.

  • Rewards can come from:

    • Human judgments → Train a reward model using human-labeled quality scores.

    • Verifiable rewards for tasks with objective correctness, like math or coding.

      • Use math checkers or unit tests.

  • Algorithms:

    • PPO (Proximal Policy Optimization) – widely used.

    • GRPO (Group Relative Policy Optimization) – from DeepSeek; powerful for verifiable tasks.

  • Course exercise: Use GRPO to train a Qwen model on math problems.