PPO vs GRPO: Reinforcement Learning Training Objectives
1. Overview
Two popular reinforcement learning training algorithms are:
-
PPO — Proximal Policy Optimization
-
GRPO — Group Relative Policy Optimization
Both optimize a policy by maximizing expected reward, but they differ primarily in how the advantage (baseline) is computed.
2. Revisiting the RL Training Objective
The general RL objective aims to maximize:
-
A probability ratio term
-
Ratio between the current model and a frozen reference model
-
-
A reward-related term
-
The advantage, which is the reward centered by a baseline
-
Key question:
How do we compute the baseline used to calculate the advantage?
3. PPO (Proximal Policy Optimization)
3.1 Baseline Estimation in PPO
-
PPO uses a separate baseline (value) model
-
This model predicts the expected reward per token
-
It is trained via supervised fine-tuning using actual rewards as targets
-
Often implemented as:
-
The same LLM with an additional value head
-
This allows:
-
Advantage computation per token
-
Different tokens to receive different credit or blame
3.2 Generalized Advantage Estimation (GAE)
-
PPO commonly uses GAE (Generalized Advantage Estimation)
-
GAE:
-
Backpropagates final rewards to earlier tokens
-
Uses a discount factor (hyperparameter)
-
Smooths reward signals across the sequence
-
Exact math is complex but not essential for intuition.
3.3 PPO Clipping Mechanism
Problem:
-
Probability ratios can become very large → unstable updates
Solution:
-
PPO clips the probability ratio
-
Prevents excessively large parameter updates
-
Ensures small, stable training steps
This clipping mechanism is the core idea of PPO.
3.4 PPO Training Setup
PPO requires:
-
Main LLM being trained
-
Frozen reference model
-
Baseline (value) estimation model
Downside:
-
Computationally expensive
-
Complex to manage multiple models
4. Motivation for GRPO
To reduce complexity and cost, newer algorithms were developed.
GRPO (Group Relative Policy Optimization) was introduced by DeepSeek with a key goal:
Eliminate the separate baseline estimation model.
5. GRPO (Group Relative Policy Optimization)
5.1 Core Idea
Instead of:
-
Training a model to predict expected rewards
GRPO:
-
Generates a group of multiple outputs for the same input
-
Computes the average reward of the group
-
Uses this average as the baseline
This baseline is computed on the fly.
5.2 Advantage Calculation in GRPO
Steps:
-
Sample multiple outputs for the same input
-
Compute rewards for each output
-
Compute:
-
Mean reward
-
Standard deviation
-
-
Normalize rewards:
-
Subtract mean
-
Divide by standard deviation
-
Result:
-
Advantages are centered around zero
-
High-performing outputs get positive advantage
-
Low-performing outputs get negative advantage
5.3 Sequence-Level Advantages
Key difference from PPO:
-
PPO computes token-level advantages
-
GRPO computes sequence-level advantages
6. Concrete Example (GRPO)
-
Sample 4 outputs for one input
-
Each output receives a reward (e.g., via unit tests or graders)
-
Baseline = average reward
-
Advantage = reward − baseline
Effects:
-
Outputs above average → probability increased
-
Outputs below average → probability decreased
7. GRPO Training Objective
GRPO:
-
Retains PPO’s clipping mechanism
-
Changes only the advantage calculation
Training loop:
-
Generate a group of outputs per input
-
Compute rewards
-
Compute average reward (baseline)
-
Calculate advantages
-
Update model using clipped objective
8. Comparison: PPO vs GRPO
Similarities
-
Both maximize expected reward
-
Both use probability ratio clipping
-
Both operate in the same RL training loop
Key Differences
| Aspect | PPO | GRPO |
|---|---|---|
| Baseline model | Separate value model | No baseline model |
| Advantage | Token-level | Sequence-level |
| Resource cost | High | Lower |
| Complexity | High | Simpler |
9. Historical Context
Evolution of methods:
-
RLHF with PPO (used before ChatGPT launch)
-
RL-AIF (AI feedback instead of humans)
-
GRPO with verifiers and reward models
10. Final Notes
-
PPO and GRPO differ only in how advantage is computed
-
GRPO significantly reduces model count and instability
-
Both plug into the same RL training loop