The Purpose I Write This Blog
To get started in mech interp research, we need to have a macro understanding of this area. So I write this blog as a summarization of this field to help you and me choose a research topic.
Circuit Discovery
Methods
- basic
- activation patching (causal mediation/interchange interventions…)
- path patching
- scaling techinques: attribution patching
- DAS (distributed alignment search) directional activation patching?
🔭 resources
- inspirition
- (ROME) Locating and Editing Factual Associations in GPT
- Attribution patching: Activation patching at industrial scale
- (ACDC) Towards Automated Circuit Discovery for Mechanistic Interpretability
- Attribution Patching Outperforms Automated Circuit Discovery
- AtP*: An efficient and scalable method for localizing llm behaviour to components
- Causal Scrubbing: a method for rigorously testing interpretability hypotheses
- new
- Using SAE
- Contextual Decomposition
- Edge Pruning ?
- Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning
Evaluation
lack of ground truth
- human
- interpretability
- automatic
faithfulness (see this)
how much of the full model’s performance can a circuit account for.
completeness
computational efficiency
Issues
- ablation methods: dropout out is also an ablation, so does zero ablation work?
- superposition, need the help of SAE?
Dictionary Learning
SAE
Training and optimization
- proper SAE width
- dead neurons
🔭 resources
- A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
- Circuit Tracing: Revealing Computational Graphs in Language Models
- Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
- Sparse Crosscoders for Cross-Layer Features and Model Diffing
- Transcoders Find Interpretable LLM Feature Circuits
- Scaling and evaluating sparse autoencoders
- Improving Dictionary Learning with Gated Sparse Autoencoders
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Evaluation
- human
- auto
🔭 resources
- SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
- (Anthropic, 2024-8, contrastive eval & sort eval) Interpretability Evals for Dictionary Learning
- (RAVEL) RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
- Language models can explain neurons in language models
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Analysis
- A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
- Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Applications
- SAE + feature discovery
🔭 resources
- SAE + circuit discovery
🔭 resources
- SAE + explain model components
- SAE + explain model behaviors
- SAE + model steering
Steering vectors
activation steering
- Steering Language Models With Activation Engineering
- Steering Llama 2 via Contrastive Activation Addition
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
- Function vectors in large language models
feature steering
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Evaluating feature steering: A case study in mitigating social biases
representation engineering
Others
- LUNAR: LLM Unlearning via Neural Activation Redirection
- Mechanistically Eliciting Latent Behaviors in Language Models
Model Diffing
Stage-wise
- fine-tuning
- fine-tuning interp
fine-tuning
scaling- Stage-Wise Model Diffing
- Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
- Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- Understanding Catastrophic Forgetting in Language Models via Implicit Inference
- Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- intrinsic dimension
- fine-tuning interp
fine-tuning
Dataset-wise
Algorithm-wise
Representation equivariance
- meta-SNE
- model stitching
- SVCCA and similar methods
- others analyses
- theory
Explain Model Components
explain neurons, attention heads and circuits
Explain neurons
🔭 resources
- Finding Neurons In A Haystack
- LatentQA: Teaching LLMs to Decode Activations Into Natural Language
- Language models can explain neurons in language models
- Multimodal Neurons in Artificial Neural Networks
- Finding Safety Neurons in Large Language Models
Explain attention heads
different heads in one layer/heads in different layer -> grammar/semantic feats
attention pattern
special heads
Explain circuits
understand specific circuits on the subspace level
- (IOI) Interpretability in The Wild: A Circuit For Indirect Object Identification in GPT-2 Small
- What Do the Circuits Mean? A Knowledge Edit View
- (DAS) Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
- LLM Circuit Analyses Are Consistent Across Training and Scale
Explain layernorm
Others
Explain Model Behaviors
Feature representations
- linear representations
- theory
- multilingual representations
- Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs
- Emerging Cross-lingual Structure in Pretrained Language Models
- Probing the Emergence of Cross-lingual Alignment during LLM Training
- Exploring Alignment in Shared Cross-lingual Spaces
- mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?
- Probing LLMs for Joint Encoding of Linguistic Categories
- Cross-Lingual Ability of Multilingual Masked Language Models: A Study of Language Structure
- multimodal representations
- safety reprs
🔭 resources
- nonlinear representations
Model capabilities
training (learning) dynamics
in-context learning
chain of thought (COT)
how and why step by step?
zero-shot COT ???- analyses
- Iteration Head: A Mechanistic Study of Chain-of-Thought
- How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
- An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs
- Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning
- A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning
- Towards Understanding How Transformer Perform Multi-step Reasoning with Matching Operation
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- unfaithful COT
- Does the model already know the answer while reasoning, or the model really has a goal?
- analyses
reasoning
planning
instruction following
- how does reinforcement learning change the inside of a model?
- understand RL at mechanistic level
- high efficient RLxF
- SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
- how does reinforcement learning change the inside of a model?
knowledge
- *Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
- *(Entity Recognition and Hallucinations) On the Biology of a Large Language Model
- *Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
- Dissecting Recall of Factual Associations in Auto-Regressive Language Models
- Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts
- Locating and Editing Factual Associations in GPT
- Characterizing Mechanisms for Factual Recall in Language Models
- Analyze the Neurons, not the Embeddings: Understanding When and Where LLM Representations Align with Humans
- The Geometry of Concepts: Sparse Autoencoder Feature Structure
memorization & generalization phase transition
learning dynamics
duplication
self-repair
massive activations
Narrow tasks
- counting
- greater-than
- Indirect Object Indentification
- gender
Interpretable model structure
also called intrinsic interpretability
Modifying Model Components
Reengineering Model Architecture
(Except interpretable model architectures, I also list some fresh-new architectures.)
CBM (Concept Bottleneck Models)
Backpack Language Models
others
new architetures
- A Path Towards Autonomous Machine Intelligence
- Large Concept Models: Language Modeling in a Sentence Representation Space
- Large Language Diffusion Models
- Fractal Generative Models
- (VLM single transformer) The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
- JEPA
Brain Inspired
- Findings
- Creations
Other Methods & Analysis
decomposing a model
representation
- SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability
- (CKA) Similarity of Neural Network Representations Revisited
- 神经网络表征度量(一)
Application
AI alignment
3H (helpful, honest, harmless)
Avoid bias and harmful behaviors
concept-based interpretability
representation-based interpretability
red-teaming
perturbations
backdoor detection, red-teaming, capability discovery
anomaly detection
backdoor detection- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks
- Mechanistic anomaly detection and ELK
- A gentle introduction to mechanistic anomaly detection
- Concrete empirical research projects in mechanistic anomaly detection
refuse to request & jailbreak
circuit; SAE; steering vector (anti-refusal)- measures
- (repr engineering) Improving Alignment and Robustness with Circuit Breakers
- (steering) Refusal in Language Models Is Mediated by a Single Direction
- (steering) Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
- (training) Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
- analyses
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Many-shot jailbreaking
- Jailbroken: How Does LLM Safety Training Fail?
- Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
- (vlm) Visual Adversarial Examples Jailbreak Aligned Large Language Models
- (vlm) Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
- (vlm) Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
- evaluation & benchmark
- Jailbreak prompts finding on Twitter
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
- measures
power-seeking
social injustice
prejudice, gender bias: doctor & nurse, discrimination
training dynamics; dataset; gradient descent; SAE circuitsdeception
dishonestyreward hacking
measurement tampering
The AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome.persona drift
other human values
Agency
Alignmnet theory
Both alignment and interpretability are related to AI safety, so the mech interp tools are widely used in alignment research. I’ll put some good resources of alignment work here.- qualitative work (findings, analyses, concepts, …) *
- alignment representation *
- instrumental convergence *
- shard theory
Proposed by Alex Turner (TurnTrout) and Quintin Pope
🔭 resources
- AI Alignment: A Comprehensive Survey
- Representation Engineering: A Top-Down Approach to AI Transparency
- Mechanistic Interpretability for AI Safety A Review
Improved Algorithms
Research Limitations
- Current work mainly focuses on Transformer-based models. Is transformer a inevitable model structure for generative language models?
- How can we use post-hoc methods as a guide for training a more interpretable and controllable model?
Other Interpretability Fields
Neural network interpretability
Not necessarily a transformer-based model, maybe an lstm or simply a toy model
- Theories for DL
- Feature Learning
- game
- geometry
- Bertology
Other Surveys
- Open Problems in Mechanistic Interpretability
- A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
- Mechanistic Interpretability for AI Safety: A Review
- A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
- Towards Uncovering How Large Language Model Works: An Explainability Perspective