Thinking and Reasoning

The Purpose I Write This Blog

Thinking models are crazily popualr nowadays. The first time I delved in this area was in September, 2023. Later I gradually forgetted this area, until Deepseek came to life. I want to keep to collect information about LLM reasoning and share my thoughts here.

Thinking Models

Evaluation

dataset

Analyses

implicit reasoning

Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

interpretability

How Reinforcement Learning After Next-Token Prediction Facilitates Learning
Base Models Know How to Reason, Thinking Models Learn When
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Thought Anchors: Which LLM Reasoning Steps Matter?
Understanding Reasoning in Thinking Language Models via Steering Vectors
Chain-of-Thought Is Not Explainability
Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization
How Do LLMs Perform Two-Hop Reasoning in Context?

theories

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Reinforcement Learning

RL algorithms

(GAE) High-Dimensional Continuous Control Using Generalized Advantage Estimation
(DPO) Direct preference optimization: Your language model is secretly a reward model
From r to q∗: Your language model is secretly a q-function
- DPO新作Your Language Model is Secretly a Q-Function解读，与OPENAI Q* 的联系？
(PPO) Proximal Policy Optimization Algorithms
(REINFORCE++) REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
(GRPO) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
(DAPO) DAPO: An Open-Source LLM Reinforcement Learning System at Scale
(GSPO) Group Sequence Policy Optimization
(Cispo) MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Blogs

algorithms

reward modeling

text
- (PRM) Let’s verify step by step
- (POLAR) Pre-Trained Policy Discriminators are General Reward Models
reward model for generative models

analyses

RL training
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
entropy
RL v.s. SFT

Resource

RL infra

The Purpose I Write This Blog#

Thinking Models#

text-based#

overthinking#

parallel thinking#

visual reasoning#

others#

Evaluation#

dataset#

Analyses#

implicit reasoning#

interpretability#

theories#

Reinforcement Learning#

RL algorithms#

Blogs#

reward modeling#

analyses#

Resource#