💡 This post is mainly focused on text models. For multi-modal models, please refer to this post.

The Purpose I Write This Blog

To get started in mech interp research, we need to have a macro understanding of this area. So I write this blog as a summarization of this field to help you and me choose a research topic.

Circuit Discovery

Methods

basic
- activation patching (causal mediation/interchange interventions…)
- path patching
- scaling techinques: attribution patching
- DAS (distributed alignment search) directional activation patching?
🔭 resources
new

Evaluation

lack of ground truth

human
- interpretability
automatic faithfulness (see this)
how much of the full model’s performance can a circuit account for.
completeness
computational efficiency
- Hypothesis Testing the Circuit Hypothesis in LLMs

Issues

ablation methods: dropout out is also an ablation, so does zero ablation work?
superposition, need the help of SAE?

Resources

Dictionary Learning

SAE

Training and optimization

proper SAE width
dead neurons

🔭 resources

Evaluation

human
auto

🔭 resources

Analysis

Applications

SAE + feature discovery
🔭 resources
- Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Anthropic research
SAE + circuit discovery
🔭 resources
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
SAE + explain model components
SAE + explain model behaviors
- The Geometry of Concepts: Sparse Autoencoder Feature Structure
SAE + model steering
- SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Steering vectors

activation steering

feature steering

representation engineering

Representation Engineering: A Top-Down Approach to AI Transparency

Others

Model Diffing

Explain Model Components

explain neurons, attention heads and circuits

Explain neurons

🔭 resources

Explain attention heads

different heads in one layer/heads in different layer -> grammar/semantic feats

Explain circuits

understand specific circuits on the subspace level

Explain layernorm

On the Nonlinearity of Layer Normalization

Others

The Quantization Model of Neural Scaling

Explain Model Behaviors

Representations

Model capabilities

training (learning) dynamics

in-context learning
- basic
  - In-context Learning and Induction Heads
  - What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
- bad in-context learning (learn wrong things)
  - Overthinking The Truth: Understanding How language Models Process False Demonstrations
chain of thought (COT)
how and why step by step?
zero-shot COT ???
reasoning
- Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization
- How Do LLMs Perform Two-Hop Reasoning in Context?
planning
- Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
instruction following
- how does reinforcement learning change the inside of a model?
  - understand RL at mechanistic level
  - high efficient RLxF
- SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
knowledge
memorization & generalization phase transition
learning dynamics
duplication
self-repair
- The Hydra Effect: Emergent Self-repair in Language Model Computations
- Explorations of Self-Repair in Language Models
massive activations
- Massive Activations in Large Language Models
- Systematic Outliers in Large Language Models

Narrow tasks

counting
greater-than
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Indirect Object Indentification
- (IOI) Interpretability in The Wild: A Circuit For Indirect Object Identification in GPT-2 Small
gender
- Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

Concept-based interpretability

Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs

Intrinsic Interpretability

Modifying Model Components

(SoLU) Softmax Linear Units

Reengineering Model Architecture

(Except interpretable model architectures, I also list some fresh-new architectures.)

Brain Inspired

Findings
Creations
- Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration

Other Methods & Analysis

decomposing a model

Interesting

We Can’t Understand AI Using our Existing Vocabulary

Applications

AI alignment

3H (helpful, honest, harmless) Avoid bias and harmful behaviors
concept-based interpretability representation-based interpretability
red-teaming perturbations
backdoor detection, red-teaming, capability discovery

anomaly detection
backdoor detection
jailbreak
circuit; SAE; steering vector (anti-refusal)
power-seeking
- Parametrically Retargetable Decision-Makers Tend To Seek Power
social injustice
prejudice, gender bias: doctor & nurse, discrimination
training dynamics; dataset; gradient descent; SAE circuits
deception
dishonesty
reward hacking
- Reward Hacking in Reinforcement Learning
measurement tampering
The AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome.
- Benchmarks for Detecting Measurement Tampering
persona drift
- Measuring and Controlling Persona Drift in Language Model Dialogs
- Towards Interpreting Visual Information Processing in Vision-Language Models
other human values
Agency
Alignmnet theory
Both alignment and interpretability are related to AI safety, so the mech interp tools are widely used in alignment research. I’ll put some good resources of alignment work here.
- qualitative work (findings, analyses, concepts, …) *
- alignment representation *
- instrumental convergence *
- shard theory
  Proposed by Alex Turner (TurnTrout) and Quintin Pope
  - The Shard Theory of Human Values
    - (lesswrong) The Shard Theory of Human Values

🔭 resources

Improved Algorithms

Research Limitations

Current work mainly focuses on Transformer-based models. Is transformer a inevitable model structure for generative language models?
How can we use post-hoc methods as a guide for training a more interpretable and controllable model?
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Other Interpretability Fields

Neural network interpretability

Not necessarily a transformer-based model, maybe an lstm or simply a toy model

Theories for DL
- foocker/deeplearningtheory
Feature Learning
- Huang wei’s repo
game
- (chess) Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
- (Sokoban, planning) Planning behavior in a recurrent neural network that plays Sokoban
geometry
- Reasoning in Large Language Models: A Geometric Perspective
Theories for Transformer
- A Mathematical Theory of Attention
Bertology
- A Primer in BERTology: What we know about how BERT works
- What do you mean, BERT? Assessing BERT as a Distributional Semantics Model

The Purpose I Write This Blog#

Circuit Discovery#

Methods#

🔭 resources#

Evaluation#

Issues#

Resources#

Dictionary Learning#

Training and optimization#

🔭 resources#

Evaluation#

🔭 resources#

Analysis#

Applications#

🔭 resources#

🔭 resources#

Steering vectors#

activation steering#

feature steering#

representation engineering#

Others#

Model Diffing#

Stage-wise#

Dataset-wise#

Algorithm-wise#

Representation equivariance#

Explain Model Components#

Explain neurons#

🔭 resources#

Explain attention heads#

Explain circuits#

Explain layernorm#

Others#

Explain Model Behaviors#

Representations#

🔭 resources#

Model capabilities#

Narrow tasks#

Concept-based interpretability#

Intrinsic Interpretability#

Modifying Model Components#

Reengineering Model Architecture#

Brain Inspired#

Other Methods & Analysis#

decomposing a model#

Interesting#

Applications#

AI alignment#

🔭 resources#

Improved Algorithms#

Research Limitations#

Other Interpretability Fields#

Neural network interpretability#

Other Surveys#

Math#

Blogs#

The Purpose I Write This Blog

Circuit Discovery

Methods

🔭 resources

Evaluation

Issues

Resources

Dictionary Learning

Training and optimization

🔭 resources

Evaluation

🔭 resources

Analysis

Applications

🔭 resources

🔭 resources

Steering vectors

activation steering

feature steering

representation engineering

Others

Model Diffing

Stage-wise

Dataset-wise

Algorithm-wise

Representation equivariance

Explain Model Components

Explain neurons

🔭 resources

Explain attention heads

Explain circuits

Explain layernorm

Others

Explain Model Behaviors

Representations

🔭 resources

Model capabilities

Narrow tasks

Concept-based interpretability

Intrinsic Interpretability

Modifying Model Components

Reengineering Model Architecture

Brain Inspired

Other Methods & Analysis

decomposing a model

Interesting

Applications

AI alignment

🔭 resources

Improved Algorithms

Research Limitations

Other Interpretability Fields

Neural network interpretability

Other Surveys

Math

Blogs