PPO vs GRPO

 

PPO vs GRPO: Reinforcement Learning Training Objectives


1. Overview

Two popular reinforcement learning training algorithms are:

  • PPO — Proximal Policy Optimization

  • GRPO — Group Relative Policy Optimization

Both optimize a policy by maximizing expected reward, but they differ primarily in how the advantage (baseline) is computed.


2. Revisiting the RL Training Objective

The general RL objective aims to maximize:

  • A probability ratio term

    • Ratio between the current model and a frozen reference model

  • A reward-related term

    • The advantage, which is the reward centered by a baseline

Key question:

How do we compute the baseline used to calculate the advantage?


3. PPO (Proximal Policy Optimization)

3.1 Baseline Estimation in PPO

  • PPO uses a separate baseline (value) model

  • This model predicts the expected reward per token

  • It is trained via supervised fine-tuning using actual rewards as targets

  • Often implemented as:

    • The same LLM with an additional value head

This allows:

  • Advantage computation per token

  • Different tokens to receive different credit or blame


3.2 Generalized Advantage Estimation (GAE)

  • PPO commonly uses GAE (Generalized Advantage Estimation)

  • GAE:

    • Backpropagates final rewards to earlier tokens

    • Uses a discount factor (hyperparameter)

    • Smooths reward signals across the sequence

Exact math is complex but not essential for intuition.


3.3 PPO Clipping Mechanism

Problem:

  • Probability ratios can become very large → unstable updates

Solution:

  • PPO clips the probability ratio

  • Prevents excessively large parameter updates

  • Ensures small, stable training steps

This clipping mechanism is the core idea of PPO.


3.4 PPO Training Setup

PPO requires:

  • Main LLM being trained

  • Frozen reference model

  • Baseline (value) estimation model

Downside:

  • Computationally expensive

  • Complex to manage multiple models


4. Motivation for GRPO

To reduce complexity and cost, newer algorithms were developed.

GRPO (Group Relative Policy Optimization) was introduced by DeepSeek with a key goal:

Eliminate the separate baseline estimation model.


5. GRPO (Group Relative Policy Optimization)

5.1 Core Idea

Instead of:

  • Training a model to predict expected rewards

GRPO:

  • Generates a group of multiple outputs for the same input

  • Computes the average reward of the group

  • Uses this average as the baseline

This baseline is computed on the fly.


5.2 Advantage Calculation in GRPO

Steps:

  1. Sample multiple outputs for the same input

  2. Compute rewards for each output

  3. Compute:

    • Mean reward

    • Standard deviation

  4. Normalize rewards:

    • Subtract mean

    • Divide by standard deviation

Result:

  • Advantages are centered around zero

  • High-performing outputs get positive advantage

  • Low-performing outputs get negative advantage


5.3 Sequence-Level Advantages

Key difference from PPO:

  • PPO computes token-level advantages

  • GRPO computes sequence-level advantages


6. Concrete Example (GRPO)

  • Sample 4 outputs for one input

  • Each output receives a reward (e.g., via unit tests or graders)

  • Baseline = average reward

  • Advantage = reward − baseline

Effects:

  • Outputs above average → probability increased

  • Outputs below average → probability decreased


7. GRPO Training Objective

GRPO:

  • Retains PPO’s clipping mechanism

  • Changes only the advantage calculation

Training loop:

  1. Generate a group of outputs per input

  2. Compute rewards

  3. Compute average reward (baseline)

  4. Calculate advantages

  5. Update model using clipped objective


8. Comparison: PPO vs GRPO

Similarities

  • Both maximize expected reward

  • Both use probability ratio clipping

  • Both operate in the same RL training loop

Key Differences

AspectPPOGRPO
Baseline modelSeparate value modelNo baseline model
AdvantageToken-levelSequence-level
Resource costHighLower
ComplexityHighSimpler

9. Historical Context

Evolution of methods:

  • RLHF with PPO (used before ChatGPT launch)

  • RL-AIF (AI feedback instead of humans)

  • GRPO with verifiers and reward models


10. Final Notes

  • PPO and GRPO differ only in how advantage is computed

  • GRPO significantly reduces model count and instability

  • Both plug into the same RL training loop