Interpretability (& other areas) for Multimodal Models

February 25, 2025 · 7 min · 3159 words · Sirius

Table of Contents

💡 This post is initially focused on interpretability for multimodal models, while later a lot of papers in other fields are included, just for convenience.

Resource

Interpretability for MLLMs

Other fields of MLLMs

Datasets & Benchmarks

general

VQA: Visual Question Answering
- repo
- download
[(VQA v2.0) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering]
(GQA) GQA:ANewDataset for Real-World Visual Reasoning and Compositional Question Answering
- [website]https://cs.stanford.edu/people/dorarad/gqa/index.html
(TextVQA) Towards VQA Models That Can Read
- website
- repo
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- repo
MMBench: Is Your Multi-modal Model an All-around Player?

spatial

hallucination

Models

llm

gpt-oss

vlm

generative models

world models