Interpretability (& other areas) for Multimodal Models

💡 This post is initially focused on interpretability for multimodal models, while later a lot of papers in other fields are included, just for convenience. Resource Interpretability for MLLMs survey A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models Sparks of Explainability Recent Advancements in Explaining Large Vision Models Awesome LMMs Mechanistic Interpretability probing Probing Multimodal Large Language Models for Global and Local Semantic Representations representation Zoom in: An introduction to circuits Multimodal Neurons in Artificial Neural Networks Interpreting CLIP’s Image Representation via Text-Based Decomposition Interpreting the Second-Order Effects of Neurons in CLIP CLIP不同层 Multimodal Neurons in Pretrained Text-Only Transformers Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers? circuit **(causal tracing) Understanding Information Storage and Transfer in Multi-modal Large Language Models Automatic Discovery of Visual Circuits Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP SAE Case study: Interpreting, manipulating, and controlling clip with sparse autoencoders Towards multimodal interpretability: Learning sparse interpretable features in vision transformers Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery visualization Visualizer!简化你的Vision Transformer可视化! (DVT) Denoising Vision Transformers Token Activation Map to Visually Explain Multimodal LLMs LVLM-Intrepret: An Interpretability Tool for Large Vision Language Models Transformer Interpretability Beyond Attention Visualization others **Towards interpreting visual information processing in vision-language models demo (dogit lens) Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models tools VLM-Lens information flow **Cross-modal Information Flow in Multimodal Large Language Models *From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks *What’s in the Image? A Deep-Dive into the Vision of Vision Language Models The Narrow Gate: Localized Image-Text Communication in Vision-Language Models Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference analyses on MLLMs Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Lost in Embeddings: Information Loss in Vision–Language Models Words or Vision: Do Vision-Language Models Have Blind Faith in Text? Forgotten Polygons: Multimodal Large Language Models are Shape-Blind Vision Transformers Need Registers On the rankability of visual embeddings Other fields of MLLMs visual pretraining ...

February 25, 2025 · 7 min · 3285 words · Sirius

Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks

ArXiv(old version): https://arxiv.org/pdf/2502.06106

February 5, 2025 · 1 min · 3 words · Sirius

Possible Research Areas in Mechanistic Interpretability

💡 This post is mainly focused on text models. For multi-modal models, please refer to this post. The Purpose I Write This Blog To get started in mech interp research, we need to have a macro understanding of this area. So I write this blog as a summarization of this field to help you and me choose a research topic. Circuit Discovery Methods basic activation patching (causal mediation/interchange interventions…) path patching scaling techinques: attribution patching DAS (distributed alignment search) directional activation patching? 🔭 resources inspirition Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned what is circuit discovery? Towards Best Practices of Activation Patching in Language Models: Metrics and Methods How to use and interpret activation patching representative work activation patching Investigating gender bias in language models using causal mediation analysis (ROME) Locating and Editing Factual Associations in GPT Causal Scrubbing: a method for rigorously testing interpretability hypotheses (AtP) Attribution patching: Activation patching at industrial scale AtP*: An efficient and scalable method for localizing llm behaviour to components path patching (ACDC) Towards Automated Circuit Discovery for Mechanistic Interpretability (EAP) Attribution Patching Outperforms Automated Circuit Discovery (EAP-IG) Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms Localizing Model Behavior with Path Patching distributed alignment search (DAS) Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Interpretability at Scale: Identifying Causal Mechanisms in Alpaca new ...

September 6, 2024 · 7 min · 3038 words · Sirius

Exploring Emotional Features in GPT2-Small

🎶Code in this post can be found at the jupyter notebook in my “saeExploration” repo. Find features that reflect positive emotions To find the features related to a specific emotion, I write five sentences containing the key words for each emotion. For example, for happy emotions I have: 1 2 3 4 5 prompt_happy = ["I'll be on a vacation tomorrow and I'm so happy.", "My mombrings home a new puppy and I'm so happy.", "I'm so glad I got the job I wanted.", "I feel so happy when I'm with my friends.", "I'm so happy I got the promotion I wanted.",] I choose to look for features that reflect happiness and sadness. Apart from that, I also wonder if the feature that reflects excitedness has something to do with the one that reflects happiness (they are alike from the semantic level at least.) ...

August 29, 2024 · 6 min · 1114 words · Sirius

A Brief Introduction to Mechanistic Interpretability Research

⚠️ Warnings This post was written when I first delved into this area, and it hasn’t been updated for a long time. Thus there might be a lot of errors. I’m still interested in interpretability and its applications. I’ll write something new and interesting later ~ 💡 This post is accompanied with another post, which contains specific content in this area. ...

August 28, 2024 · 15 min · 3160 words · Sirius