PPO vs GRPO: Reinforcement Learning Training Objectives

1. Overview

Two popular reinforcement learning training algorithms are:

PPO — Proximal Policy Optimization
GRPO — Group Relative Policy Optimization

Both optimize a policy by maximizing expected reward, but they differ primarily in how the advantage (baseline) is computed.

2. Revisiting the RL Training Objective

The general RL objective aims to maximize:

A probability ratio term
- Ratio between the current model and a frozen reference model
A reward-related term
- The advantage, which is the reward centered by a baseline

Key question:

How do we compute the baseline used to calculate the advantage?

3. PPO (Proximal Policy Optimization)

3.1 Baseline Estimation in PPO

PPO uses a separate baseline (value) model
This model predicts the expected reward per token
It is trained via supervised fine-tuning using actual rewards as targets
Often implemented as:
- The same LLM with an additional value head

This allows:

Advantage computation per token
Different tokens to receive different credit or blame

3.2 Generalized Advantage Estimation (GAE)

PPO commonly uses GAE (Generalized Advantage Estimation)
GAE:
- Backpropagates final rewards to earlier tokens
- Uses a discount factor (hyperparameter)
- Smooths reward signals across the sequence

Exact math is complex but not essential for intuition.

3.3 PPO Clipping Mechanism

Problem:

Probability ratios can become very large → unstable updates

Solution:

PPO clips the probability ratio
Prevents excessively large parameter updates
Ensures small, stable training steps

This clipping mechanism is the core idea of PPO.

3.4 PPO Training Setup

PPO requires:

Main LLM being trained
Frozen reference model
Baseline (value) estimation model

Downside:

Computationally expensive
Complex to manage multiple models

4. Motivation for GRPO

To reduce complexity and cost, newer algorithms were developed.

GRPO (Group Relative Policy Optimization) was introduced by DeepSeek with a key goal:

Eliminate the separate baseline estimation model.

5. GRPO (Group Relative Policy Optimization)

5.1 Core Idea

Instead of:

Training a model to predict expected rewards

GRPO:

Generates a group of multiple outputs for the same input
Computes the average reward of the group
Uses this average as the baseline

This baseline is computed on the fly.

5.2 Advantage Calculation in GRPO

Steps:

Sample multiple outputs for the same input
Compute rewards for each output
Compute:
- Mean reward
- Standard deviation
Normalize rewards:
- Subtract mean
- Divide by standard deviation

Result:

Advantages are centered around zero
High-performing outputs get positive advantage
Low-performing outputs get negative advantage

5.3 Sequence-Level Advantages

Key difference from PPO:

PPO computes token-level advantages
GRPO computes sequence-level advantages

6. Concrete Example (GRPO)

Sample 4 outputs for one input
Each output receives a reward (e.g., via unit tests or graders)
Baseline = average reward
Advantage = reward − baseline

Effects:

Outputs above average → probability increased
Outputs below average → probability decreased

7. GRPO Training Objective

GRPO:

Retains PPO’s clipping mechanism
Changes only the advantage calculation

Training loop:

Generate a group of outputs per input
Compute rewards
Compute average reward (baseline)
Calculate advantages
Update model using clipped objective

8. Comparison: PPO vs GRPO

Similarities

Both maximize expected reward
Both use probability ratio clipping
Both operate in the same RL training loop

Key Differences

Aspect	PPO	GRPO
Baseline model	Separate value model	No baseline model
Advantage	Token-level	Sequence-level
Resource cost	High	Lower
Complexity	High	Simpler

9. Historical Context

Evolution of methods:

RLHF with PPO (used before ChatGPT launch)
RL-AIF (AI feedback instead of humans)
GRPO with verifiers and reward models

10. Final Notes

PPO and GRPO differ only in how advantage is computed
GRPO significantly reduces model count and instability
Both plug into the same RL training loop