The Purpose I Write This Blog
Thinking models are crazily popualr nowadays. The first time I delved in this area was in September, 2023. Later I gradually forgetted this area, until Deepseek came to life. I want to keep to collect information about LLM reasoning and share my thoughts here.
Thinking Models
text-based
- explicit reasoning
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
- Kimi k1.5: Scaling Reinforcement Learning with LLMs
- GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
- Skywork Open Reasoner 1 Technical Report
- implicit reasoning
- others
- blogs
- explicit reasoning
overthinking
- survey
- papers
- blogs
parallel thinking
visual reasoning
- survey
- papers
- $V^{*}$: Guided Visual Search as a Core Mechanism in Multimodal LLMs
- active perception
- tool use
- imagination
- blogs
others
Evaluation
- dataset
Analyses
- analyses
- interpretability
- Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
- Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Thought Anchors: Which LLM Reasoning Steps Matter?
- Understanding Reasoning in Thinking Language Models via Steering Vectors
- Chain-of-Thought Is Not Explainability
- Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization
- How Do LLMs Perform Two-Hop Reasoning in Context?
- theories
Reinforcement Learning
RL algorithms
- (GAE) High-Dimensional Continuous Control Using Generalized Advantage Estimation
- (DPO) Direct preference optimization: Your language model is secretly a reward model
- From r to q∗: Your language model is secretly a q-function
- (PPO) Proximal Policy Optimization Algorithms
- (REINFORCE++) REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
- (GRPO) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- (DAPO) DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- (GSPO) Group Sequence Policy Optimization
- (Cispo) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Blogs
reward modeling
analyses
- RL training
- entropy
- RL v.s. SFT
- 3.2 统一视角理解从 SFT 到 RL
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
- Generalist Reward Models: Found Inside Large Language Models
- (DFT) On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
- (NFT) Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Resource
- RL infra
- (verl) HybridFlow: A Flexible and Efficient RLHF Framework
- slime
- blogs
- training
- inference