Thinking and Reasoning
The Purpose I Write This Blog Thinking models are crazily popualr nowadays. The first time I delved in this area was in September, 2023. Later I gradually forgetted this area, until Deepseek came to life. I want to keep to collect information about LLM reasoning and share my thoughts here. Thinking Models text-based explicit reasoning DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning Kimi k1.5: Scaling Reinforcement Learning with LLMs GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Skywork Open Reasoner 1 Technical Report implicit reasoning (Coconut) Training Large Language Models to Reason in a Continuous Latent Space others ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline blogs 自顶向下方式深度解读 DeepSeek-R1,内含大量细节 MLA(1):从代码角度学习和彻底理解 DeepSeek MLA 算法 从头理解思考模型(LLM based Reasoning Model),O1,DeepSeek R1,Kimi K1.5 overthinking survey Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (repo) Awesome-Efficient-Reasoning-LLMs papers Qwen3 Technical Report AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning AdaptThink: Reasoning Models Can Learn When to Think blogs 自适应快慢思考推理模型(Adaptive Reasoning Model):Qwen3混合思考->字节AdaCoT->清华AdaptThinking parallel thinking Deep Think with Confidence visual reasoning survey Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers papers $V^{*}$: Guided Visual Search as a Core Mechanism in Multimodal LLMs active perception DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL GRIT: Teaching MLLMs to Think with Images tool use VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models imagination Thinking with Generated Images Visual Planning: Let’s Think Only with Images blogs Thinking with Images 小结 others [蒙特卡洛搜索树] MCT Self-Refine (MCTSr)的算法(包含代码理解) 聊聊推理模型中的PRMs与MCTS Evaluation dataset Analyses implicit reasoning Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought interpretability How Reinforcement Learning After Next-Token Prediction Facilitates Learning Base Models Know How to Reason, Thinking Models Learn When Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning Thought Anchors: Which LLM Reasoning Steps Matter? Understanding Reasoning in Thinking Language Models via Steering Vectors Chain-of-Thought Is Not Explainability Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization How Do LLMs Perform Two-Hop Reasoning in Context? theories Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought Reinforcement Learning RL algorithms (GAE) High-Dimensional Continuous Control Using Generalized Advantage Estimation (DPO) Direct preference optimization: Your language model is secretly a reward model From r to q∗: Your language model is secretly a q-function DPO新作Your Language Model is Secretly a Q-Function解读,与OPENAI Q* 的联系? (PPO) Proximal Policy Optimization Algorithms (REINFORCE++) REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models (GRPO) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DAPO) DAPO: An Open-Source LLM Reinforcement Learning System at Scale (GSPO) Group Sequence Policy Optimization (Cispo) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Blogs algorithms 人人都能看懂的RL-PPO理论知识 Reasoning LLM(三):LLM+RL RLHF 常见的思维误区 reward modeling text (PRM) Let’s verify step by step (POLAR) Pre-Trained Policy Discriminators are General Reward Models reward model for generative models Improving Video Generation with Human Feedback VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Black-Box Prompt Optimization: Aligning Large Language Models without Model Training analyses RL training Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning entropy Reasoning LLM(五):熵缩过程与能力边界 LLMxRL】熵坍缩与缓解策略 (clip/kl-cov) The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (forking tokens) Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning RL v.s. SFT Sft memorizes, rl generalizes: A comparative study of foundation model post-training RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs 3.2 统一视角理解从 SFT 到 RL All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning Generalist Reward Models: Found Inside Large Language Models (DFT) On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification 从 SFT 到 RL:一步步看清它们的联系 (NFT) Bridging Supervised Learning and Reinforcement Learning in Math Reasoning Resource RL infra (verl) HybridFlow: A Flexible and Efficient RLHF Framework doc repo slime RL Scaling 时代,我们需要什么样的 RL 框架呢? blogs training 浅聊RL框架的勃勃生机、万物竞发 How we built our multi-agent research system verl [AI Infra] VeRL 框架入门&代码带读 从零开始的verl框架解析 verl RL支持训练deepseek-v3 671B实习复盘(个人版) OpenRLHF&Verl参数转换指南 verl小白解读 一文深度全面解析大模型分布式并行策略:DP/TP/PP/CP/EP/SP 深入理解 Megatron-LM(2)原理介绍 DeepSpeed zero1,zero2,zero3和FSDP区别详解 inference SGLang:LLM推理引擎发展新方向 图解大模型计算加速系列:FlashAttention V1,从硬件到计算逻辑