Thinking and Reasoning

The Purpose I Write This Blog Thinking models are crazily popualr nowadays. The first time I delved in this area was in September, 2023. Later I gradually forgetted this area, until Deepseek came to life. I want to keep to collect information about LLM reasoning (as well as post-training) and share my thoughts here. 💡 This post is mainly focused on reasoning RL. For agentic RL, please refer to this post. Reinforcement Learning Blogs algorithms 强化学习：从策略梯度到TRPO、PPO、DPO、GRPO 人人都能看懂的RL-PPO理论知识 Reasoning LLM（三）：LLM+RL RLHF 常见的思维误区 RL algorithms blogs From GRPO to DAPO and GSPO: What, Why, and How meta-RL 【强化学习解惑】元强化学习（Meta-RL）是什么？它与普通强化学习有什么区别？ general algorithms (GAE) High-Dimensional Continuous Control Using Generalized Advantage Estimation (DPO) Direct preference optimization: Your language model is secretly a reward model 推导 From r to q∗: Your language model is secretly a q-function DPO新作Your Language Model is Secretly a Q-Function解读，与OPENAI Q* 的联系？ (PPO) Proximal Policy Optimization Algorithms (REINFORCE++) REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models (GRPO) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DAPO) DAPO: An Open-Source LLM Reinforcement Learning System at Scale (GSPO) Group Sequence Policy Optimization (Cispo) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention RL optimization tricks GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning SRPO ：Enhancing Multimodal LLM Reasoning viaReflection-Aware Reinforcement Learning (SPO1) Simple Policy Optimization (SPO2) Single-stream Policy Optimization (SPRO) Self-Guided Process Reward Optimization for training-inference matching Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (IcePop) Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop! IcePop: Stabilizing RL in MoE Models reward modeling text (PRM) Let’s verify step by step (POLAR) Pre-Trained Policy Discriminators are General Reward Models reward model for generative models Improving Video Generation with Human Feedback VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Black-Box Prompt Optimization: Aligning Large Language Models without Model Training learn from RMs 介绍 Composer 2.5 (SDPO) Reinforcement Learning via Self-Distillation (Critique-GRPO) Advancing LLM Reasoning with Natural Language and Numerical Feedback mix-training SFT-then-RL SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning Learning to Reason under Off-Policy Guidance Distillation On-Policy Distillation blogs On-Policy Distillation 系统聊聊 On-Policy Distillation 的原理 扒一扒大模型能力合版的门道和OPD SFT, RL, and On-Policy Distillation Through a Distributional Lens 近期关于Off Policy&On-Policy Distillation的一些总结 近半年 On-Policy Distillation 的三大主流方向：一个方法解决两道难题 papers OPRD: On-Policy Representation Distillation Self-Distillation papers (SDPO) Reinforcement Learning via Self-Distillation Self-Distillation Enables Continual Learning (SDAR) Self-Distilled Agentic Reinforcement Learning (OPSD) Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models Engineering environment Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization infra (verl) HybridFlow: A Flexible and Efficient RLHF Framework doc(cn) doc(en) repo slime RL Scaling 时代，我们需要什么样的 RL 框架呢？ blogs training 浅聊RL框架的勃勃生机、万物竞发 How we built our multi-agent research system verl [AI Infra] VeRL 框架入门&代码带读 从零开始的verl框架解析 verl RL支持训练deepseek-v3 671B实习复盘(个人版) OpenRLHF&Verl参数转换指南 verl小白解读 一文深度全面解析大模型分布式并行策略：DP/TP/PP/CP/EP/SP 深入理解 Megatron-LM（2）原理介绍 DeepSpeed zero1，zero2，zero3和FSDP区别详解 inference SGLang：LLM推理引擎发展新方向 图解大模型计算加速系列：FlashAttention V1，从硬件到计算逻辑 Analyses RL training Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning entropy Reasoning LLM（五）：熵缩过程与能力边界 LLMxRL】熵坍缩与缓解策略 (clip/kl-cov) The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (forking tokens) Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning KL divergence A Comedy of Estimators: On KL Regularization in RL Training of LLMs RL 中的 KL 估计器选型：从数值无偏到梯度正确 RL v.s. SFT Sft memorizes, rl generalizes: A comparative study of foundation model post-training RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs 3.2 统一视角理解从 SFT 到 RL All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning Generalist Reward Models: Found Inside Large Language Models (DFT) On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification 从 SFT 到 RL：一步步看清它们的联系 (NFT) Bridging Supervised Learning and Reinforcement Learning in Math Reasoning others (training dynamics, mechanisms, …) How’s it going? Reinforcement learning in language models recruits a functional welfare axis Thinking Models text-based explicit reasoning DeepSeek-V3 Technical Report DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning Kimi k1.5: Scaling Reinforcement Learning with LLMs Kimi K2: Open Agentic Intelligence Kimi K2.5: Visual Agentic Intelligence GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models GLM-5: from Vibe Coding to Agentic Engineering MAI-Thinking-1: Building a Hill-Climbing Machine (Ring-1T) Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model Skywork Open Reasoner 1 Technical Report implicit reasoning (Coconut) Training Large Language Models to Reason in a Continuous Latent Space others ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline fancy From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones blogs 自顶向下方式深度解读 DeepSeek-R1，内含大量细节 MLA(1)：从代码角度学习和彻底理解 DeepSeek MLA 算法 从头理解思考模型（LLM based Reasoning Model），O1，DeepSeek R1，Kimi K1.5 overthinking survey Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (repo) Awesome-Efficient-Reasoning-LLMs papers Qwen3 Technical Report AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning AdaptThink: Reasoning Models Can Learn When to Think blogs 自适应快慢思考推理模型（Adaptive Reasoning Model）：Qwen3混合思考->字节AdaCoT->清华AdaptThinking parallel thinking Deep Think with Confidence visual reasoning survey Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers papers $V^{*}$: Guided Visual Search as a Core Mechanism in Multimodal LLMs active perception DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL GRIT: Teaching MLLMs to Think with Images tool use VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models imagination Thinking with Generated Images Visual Planning: Let’s Think Only with Images blogs Thinking with Images 小结 Long Context GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment others [蒙特卡洛搜索树] MCT Self-Refine (MCTSr)的算法（包含代码理解） 聊聊推理模型中的PRMs与MCTS Evaluation dataset Analyses implicit reasoning Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought interpretability https://arxiv.org/pdf/2512.15605 How Reinforcement Learning After Next-Token Prediction Facilitates Learning Base Models Know How to Reason, Thinking Models Learn When Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning Thought Anchors: Which LLM Reasoning Steps Matter? Understanding Reasoning in Thinking Language Models via Steering Vectors Chain-of-Thought Is Not Explainability Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization How Do LLMs Perform Two-Hop Reasoning in Context? theories Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

The Purpose I Write This Blog#

Reinforcement Learning#

Blogs#

RL algorithms#

Engineering#

Analyses#

Thinking Models#

text-based#

overthinking#

parallel thinking#

visual reasoning#

Long Context#

others#

Evaluation#

dataset#

Analyses#

implicit reasoning#

interpretability#

theories#

The Purpose I Write This Blog

Reinforcement Learning

Blogs

RL algorithms

Engineering

Analyses

Thinking Models

text-based

overthinking

parallel thinking

visual reasoning

Long Context

others

Evaluation

dataset

Analyses

implicit reasoning

interpretability

theories