The Purpose I Write This Blog

   To get started in mech interp research, we need to have a macro understanding of this area. So I write this blog as a summarization of this field to help you and me choose a research topic.

Circuit Discovery

Methods

Evaluation

lack of ground truth

Issues

  • ablation methods: dropout out is also an ablation, so does zero ablation work?
  • superposition, need the help of SAE?

Dictionary Learning

SAE

Training and optimization

  • proper SAE width
  • dead neurons

🔭 resources

Evaluation

  • human
  • auto

🔭 resources

Analysis

Applications

Steering vectors

activation steering

feature steering

representation engineering

Others

Model Diffing

Stage-wise

Dataset-wise

Algorithm-wise

Representation equivariance

Explain Model Components

explain neurons, attention heads and circuits

Explain neurons

🔭 resources

Explain attention heads

different heads in one layer/heads in different layer -> grammar/semantic feats

Explain circuits

understand specific circuits on the subspace level

Explain layernorm

Others

Explain Model Behaviors

Feature representations

Model capabilities

training (learning) dynamics

Narrow tasks

Interpretable model structure

also called intrinsic interpretability

Modifying Model Components

Reengineering Model Architecture

(Except interpretable model architectures, I also list some fresh-new architectures.)

Brain Inspired

Other Methods & Analysis

decomposing a model

representation

Application

AI alignment

3H (helpful, honest, harmless) Avoid bias and harmful behaviors
concept-based interpretability representation-based interpretability
red-teaming perturbations
backdoor detection, red-teaming, capability discovery

🔭 resources

Improved Algorithms

Research Limitations

  • Current work mainly focuses on Transformer-based models. Is transformer a inevitable model structure for generative language models?
  • How can we use post-hoc methods as a guide for training a more interpretable and controllable model?

Other Interpretability Fields

Neural network interpretability

Not necessarily a transformer-based model, maybe an lstm or simply a toy model

Other Surveys