The Purpose I Write This Blog
Thinking models are crazily popualr nowadays. The first time I delved in this area was in September, 2023. Later I gradually forgetted this area, until Deepseek came to life. I want to keep to collect information about LLM reasoning (as well as post-training) and share my thoughts here.
💡 This post is mainly focused on reasoning RL. For agentic RL, please refer to this post.
Reinforcement Learning
Blogs
RL algorithms
blogs
general algorithms
- (GAE) High-Dimensional Continuous Control Using Generalized Advantage Estimation
- (DPO) Direct preference optimization: Your language model is secretly a reward model
- From r to q∗: Your language model is secretly a q-function
- (PPO) Proximal Policy Optimization Algorithms
- (REINFORCE++) REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
- (GRPO) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- (DAPO) DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- (GSPO) Group Sequence Policy Optimization
- (Cispo) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
- RL optimization tricks
- GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
- DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
- SRPO :Enhancing Multimodal LLM Reasoning viaReflection-Aware Reinforcement Learning
- (SPO1) Simple Policy Optimization
- (SPO2) Single-stream Policy Optimization
- (SPRO) Self-Guided Process Reward Optimization
- for training-inference matching
reward modeling
- text
- reward model for generative models
- learn from RMs
mix-training
Distillation
- On-Policy Distillation
- Self-Distillation
Engineering
- environment
- infra
- (verl) HybridFlow: A Flexible and Efficient RLHF Framework
- slime
- blogs
- training
- inference
Analyses
- RL training
- entropy
- KL divergence
- RL v.s. SFT
- Sft memorizes, rl generalizes: A comparative study of foundation model post-training
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- 3.2 统一视角理解从 SFT 到 RL
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
- Generalist Reward Models: Found Inside Large Language Models
- (DFT) On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- (NFT) Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
- others (training dynamics, mechanisms, …)
Thinking Models
text-based
- explicit reasoning
- DeepSeek-V3 Technical Report
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
- DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
- Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
- Kimi k1.5: Scaling Reinforcement Learning with LLMs
- Kimi K2: Open Agentic Intelligence
- Kimi K2.5: Visual Agentic Intelligence
- GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
- GLM-5: from Vibe Coding to Agentic Engineering
- MAI-Thinking-1: Building a Hill-Climbing Machine
- (Ring-1T) Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
- Skywork Open Reasoner 1 Technical Report
- implicit reasoning
- others
- fancy
- blogs
overthinking
- survey
- papers
- blogs
parallel thinking
visual reasoning
- survey
- papers
- $V^{*}$: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- active perception
- tool use
- VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
- VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
- Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
- Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
- imagination
- blogs
Long Context
others
Evaluation
dataset
Analyses
implicit reasoning
interpretability
- https://arxiv.org/pdf/2512.15605
- How Reinforcement Learning After Next-Token Prediction Facilitates Learning
- Base Models Know How to Reason, Thinking Models Learn When
- Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
- Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Thought Anchors: Which LLM Reasoning Steps Matter?
- Understanding Reasoning in Thinking Language Models via Steering Vectors
- Chain-of-Thought Is Not Explainability
- Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization
- How Do LLMs Perform Two-Hop Reasoning in Context?