Towards the end of last year, I wrote the below article, where I spoke about the emergence of reasoners and the thinker/actor pattern. In the early part of this year, I see this prediction coming true.
However, I was not paying the required attention that I should have paid to deepseek until one of my bay area friends pinged me and said “have you looked at deepseek reasoner, it is giving me good results”. My friend is deep into researching on thought prompting. That led me to read the below paper.
https://drive.google.com/viewerng/viewer?url=https://arxiv.org/pdf/2501.12948
In this article, I am going to share a non-mathematical explanation of the article. I am doing some more deep reading of this paper to also understand the math behind the reward model.
While reading the paper, I felt DeepSeek has introduced a unique approach that leverages reinforcement learning (RL) to enhance reasoning without reliance on supervised fine-tuning (SFT). Here’s how they did it:
The Foundations: DeepSeek-R1-Zero
DeepSeek-R1-Zero is the first step in their exploration. The model was trained using a pure RL framework, starting from a baseline without any prior SFT. The training utilized Group Relative Policy Optimization (GRPO), a RL algorithm that uses group-based scoring to guide policy optimization(this is the part I need to dig more). This approach enabled the model to autonomously develop reasoning patterns like self-verification, reflection, and generating longer chains of thought (CoT).
This phase included
Reward Modeling: The reward model acts as the source of training signal. It guides the optimization direction of RL. Two types of rewards guided the RL process — accuracy rewards, which evaluated the correctness of responses, and format rewards, which encouraged the model to organize its reasoning within structured <think> tags.
Self-Evolution: The model progressively improved through thousands of RL steps, with performance on benchmarks like AIME 2024 jumping from 15.6% to an impressive 71.0%.
Despite these achievements, DeepSeek-R1-Zero faced challenges like poor readability and language mixing, prompting the team to refine their approach.
Advancing the Model: DeepSeek-R1
To overcome the limitations of DeepSeek-R1-Zero, the team introduced DeepSeek-R1, a model trained through a multi-stage pipeline that combined cold-start data with iterative RL and supervised fine-tuning.
The Multi-Stage Training Pipeline
Cold Start: Thousands of examples featuring long CoTs were curated to fine-tune the base model. This improved readability by incorporating structured formats and summaries in outputs.
Reasoning-Oriented RL: Following the cold start, the model underwent another round of large-scale RL, focusing on reasoning-intensive tasks like coding and mathematics.
Rejection Sampling & Fine-Tuning: After RL convergence, new supervised data was generated via rejection sampling(another area where I need to do some more digging). This data, combined with general-purpose supervised tasks, was used to retrain the model, further enhancing its versatility.
Final RL Phase: The model was aligned with human preferences using a combination of reasoning rewards and diverse prompt distributions.
This iterative process yielded a model — DeepSeek-R1 — that excelled across a range of benchmarks, achieving performance comparable to OpenAI-o1–1217 in reasoning tasks and setting new records for open-source models in benchmarks like MATH-500 and AIME 2024.
Scaling Down: Distillation to Smaller Dense Models
DeepSeek didn’t stop with high-performing large models. Recognizing the importance of accessibility, they distilled DeepSeek-R1 into smaller, efficient models. Using the reasoning data generated by DeepSeek-R1, they fine-tuned open-source models like Qwen and Llama. The results were remarkable, with the 14B and 32B distilled models outperforming state-of-the-art open-source counterparts on benchmarks like LiveCodeBench and GPQA Diamond.
Key Takeaways
While reading the paper, it looked like Deepseek is following a approach that we usually apply to humans. When we were kids, we learn the world model through re-inforcement learning. In his book “A brief history of Intelligence”, Max Bennett called this as “breakthrough#2” in human intelligence.
Animals learn by first performing random exploring actions and then adjusting future actions based on valence outcomes.
Once the kid grows up into adulthood, it teaches other kids to hone their base skills to develop niche skills as architects, mathematician or physicist. The distillation process to create dense models looked similar to this phenomenon.
I end this article with a thought that comes to my mind on which I am doing some further research.
Are thoughts and Language connected? Do we need language to be able to think or can we think without knowing any language?