Post Training in the wild

How Fine-Tuning & Reinforcement Learning Enable Reasoning in LLMs

pasted

Reasoning ability has rapidly advanced in frontier LLMs such as ChatGPT, Claude, and DeepSeek.
Two post-training methods make this possible: fine-tuning and reinforcement learning (RL).
Both techniques teach models to think before answering, improving accuracy on complex tasks.


🤔 What "Reasoning Mode" Really Means

Frontier models often show a “thinking…” indicator before answering.
Under the hood, they generate internal reasoning tokens—multiple hypothesized paths—before responding.

This hidden reasoning helps models:

  • evaluate intermediate steps

  • converge on a more accurate answer

  • reduce brittle, pattern-matching errors


1️⃣ Fine-Tuning for Reasoning — Chain-of-Thought

The problem with simple answer training

Training only on (question → answer) pairs makes models:

  • output correct answers for easy problems

  • guess or fail on multi-step problems

The solution: train on step-by-step reasoning

Instead of:

<answer> 42 </answer>

We use:

<think> step-by-step reasoning... </think> <answer> 42 </answer>

This structured reasoning format is called Chain of Thought (CoT).

Why it works

  • Models learn process patterns, not just outputs

  • Generalizes better to harder problems

  • LLMs can even generate reasoning templates to scale data creation

Outcome:
Models answer more multi-step math questions correctly after CoT fine-tuning.


2️⃣ Reinforcement Learning for Reasoning — Reward the Answer Only

With RL:

  • the format or quality of reasoning is irrelevant

  • the final answer’s correctness is what matters

Example:

<think> gibberish or mixed languages </think> <answer> 5 </answer>

If 5 is correct → rewarded.

What RL enables

  • freedom to invent new reasoning shortcuts

  • improved efficiency and correctness

  • answers that surpass fine-tuning-only approaches

But RL can also cause:

  • unreadable reasoning traces

  • endless repetitive loops

  • mixed or random languages in reasoning paths


3️⃣ Case Study: DeepSeek R1-Zero

DeepSeek R1-Zero learned reasoning using RL only, without fine-tuning.

Strengths

  • surprisingly capable reasoning from simple reward checks

Weaknesses

  • repetition problems

  • unreadable reasoning

  • heavy language mixing

Conclusion:
RL alone works, but lacks consistency and readability.


4️⃣ Frontier Recipe: Combine Fine-Tuning + RL

Frontier models typically train in multiple rounds:

PhasePurpose
Fine-Tuning w/ CoTLearn structured, readable reasoning
Reinforcement LearningImprove correctness & efficiency
Repeat cyclesBlend reasoning + general dialog skills

By the final rounds:

  • models can reason deeply

  • speak naturally

  • handle both math/coding and general conversation


📌 Key Takeaways

MethodStrengthWeakness
Fine-Tuning w/ CoTreadable, structured reasoningmay be verbose, less efficient
Reinforcement Learningdiscovers better strategiesunstable, unreadable reasoning
Combined approachbest performancemulti-stage training required

✔ In Summary

  • Fine-tuning teaches how to think → structured steps

  • RL teaches what works → correctness above all

  • Together, they create models that reason accurately and interact naturally