Interpretability for Multimodal Models

💡 This post is initially focused on interpretability for multimodal models, while later a lot of papers in other fields are included, just for convenience. Resource Interpretability for MLLMs survey A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models Sparks of Explainability Recent Advancements in Explaining Large Vision Models Awesome LMMs Mechanistic Interpretability probing Probing Multimodal Large Language Models for Global and Local Semantic Representations representation Zoom in: An introduction to circuits Multimodal Neurons in Artificial Neural Networks Interpreting CLIP’s Image Representation via Text-Based Decomposition Interpreting the Second-Order Effects of Neurons in CLIP CLIP不同层 Multimodal Neurons in Pretrained Text-Only Transformers circuit **(causal tracing) Understanding Information Storage and Transfer in Multi-modal Large Language Models Automatic Discovery of Visual Circuits Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP SAE Case study: Interpreting, manipulating, and controlling clip with sparse autoencoders Towards multimodal interpretability: Learning sparse interpretable features in vision transformers Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery visualization Visualizer!简化你的Vision Transformer可视化! (DVT) Denoising Vision Transformers Token Activation Map to Visually Explain Multimodal LLMs LVLM-Intrepret: An Interpretability Tool for Large Vision Language Models Transformer Interpretability Beyond Attention Visualization others **Towards interpreting visual information processing in vision-language models demo (dogit lens) Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space information flow **Cross-modal Information Flow in Multimodal Large Language Models *From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks *What’s in the Image? A Deep-Dive into the Vision of Vision Language Models The Narrow Gate: Localized Image-Text Communication in Vision-Language Models Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference analyses on MLLMs Words or Vision: Do Vision-Language Models Have Blind Faith in Text? Forgotten Polygons: Multimodal Large Language Models are Shape-Blind Vision Transformers Need Registers On the rankability of visual embeddings Other fields of MLLMs visual pretraining ...

February 25, 2025 · 5 min · 2272 words · Sirius

Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

ArXiv(old version): https://arxiv.org/pdf/2502.06106

February 5, 2025 · 1 min · 3 words · Sirius

Possible Research Areas in Mechanistic Interpretability

💡 This post is mainly focused on text models. For multi-modal models, please refer to this post. The Purpose I Write This Blog To get started in mech interp research, we need to have a macro understanding of this area. So I write this blog as a summarization of this field to help you and me choose a research topic. Circuit Discovery Methods basic activation patching (causal mediation/interchange interventions…) path patching scaling techinques: attribution patching DAS (distributed alignment search) directional activation patching? 🔭 resources inspirition Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned (ROME) Locating and Editing Factual Associations in GPT Attribution patching: Activation patching at industrial scale (ACDC) Towards Automated Circuit Discovery for Mechanistic Interpretability Attribution Patching Outperforms Automated Circuit Discovery AtP*: An efficient and scalable method for localizing llm behaviour to components Causal Scrubbing: a method for rigorously testing interpretability hypotheses new Using SAE Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Automatically Identifying Local and Global Circuits with Linear Computation Graphs Contextual Decomposition Mechanistic Interpretation through Contextual Decomposition in Transformers Edge Pruning ? Finding Transformer Circuits with Edge Pruning Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning Evaluation lack of ground truth ...

September 6, 2024 · 6 min · 2756 words · Sirius

Exploring Emotional Features in GPT2-Small

🎶Code in this post can be found at the jupyter notebook in my “saeExploration” repo. Find features that reflect positive emotions To find the features related to a specific emotion, I write five sentences containing the key words for each emotion. For example, for happy emotions I have: 1 2 3 4 5 prompt_happy = ["I'll be on a vacation tomorrow and I'm so happy.", "My mombrings home a new puppy and I'm so happy.", "I'm so glad I got the job I wanted.", "I feel so happy when I'm with my friends.", "I'm so happy I got the promotion I wanted.",] I choose to look for features that reflect happiness and sadness. Apart from that, I also wonder if the feature that reflects excitedness has something to do with the one that reflects happiness (they are alike from the semantic level at least.) ...

August 29, 2024 · 6 min · 1114 words · Sirius

A Brief Introduction to Mechanistic Interpretability Research

⚠️ Warnings This post was written when I first delved into this area, and it hasn’t been updated for a long time. Thus there might be a lot of errors. Now I’ve changed my attitude to this area. The area is not well-defined, and most of the research in this area is of low quality and is not appealing to me. Besides, I think the study of interpretability should be applied to pratical use, though we can also study it for fun. I’m still interested in interpretability and its applications. I’ll write something new and interesting later ~ 💡 This post is accompanied with another post, which contains specific content in this area. ...

August 28, 2024 · 16 min · 3210 words · Sirius