💡 This post is mainly focused on text models. For multi-modal models, please refer to this post.

The Purpose I Write This Blog

   To get started in mech interp research, we need to have a macro understanding of this area. So I write this blog as a summarization of this field to help you and me choose a research topic.

Circuit Discovery

Methods

Evaluation

lack of ground truth

Issues

  • ablation methods: dropout out is also an ablation, so does zero ablation work?
  • superposition, need the help of SAE?

Resources

Dictionary Learning

SAE

Training and optimization

  • proper SAE width
  • dead neurons

🔭 resources

Evaluation

  • human
  • auto

🔭 resources

Analysis

Applications

Steering vectors

activation steering

feature steering

representation engineering

Others

Model Diffing

Stage-wise

Dataset-wise

Algorithm-wise

Representation equivariance

Explain Model Components

explain neurons, attention heads and circuits

Explain neurons

🔭 resources

Explain attention heads

different heads in one layer/heads in different layer -> grammar/semantic feats

Explain circuits

understand specific circuits on the subspace level

Explain layernorm

Others

Explain Model Behaviors

Representations

Model capabilities

training (learning) dynamics

Narrow tasks

Concept-based interpretability

Intrinsic Interpretability

Modifying Model Components

Reengineering Model Architecture

(Except interpretable model architectures, I also list some fresh-new architectures.)

Brain Inspired

Other Methods & Analysis

decomposing a model

Interesting

Applications

AI alignment

3H (helpful, honest, harmless) Avoid bias and harmful behaviors
concept-based interpretability representation-based interpretability
red-teaming perturbations
backdoor detection, red-teaming, capability discovery

🔭 resources

Improved Algorithms

Research Limitations

Other Interpretability Fields

Neural network interpretability

Not necessarily a transformer-based model, maybe an lstm or simply a toy model

Other Surveys

Math

Blogs