Interpretability (& other areas) for Multimodal Models

💡 This post is initially focused on interpretability for multimodal models, while later a lot of papers in other fields are included, just for convenience.

Methods

Datasets & Benchmarks

general

VQA: Visual Question Answering
- repo
- download
[(VQA v2.0) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering]
(GQA) GQA:ANewDataset for Real-World Visual Reasoning and Compositional Question Answering
- [website]https://cs.stanford.edu/people/dorarad/gqa/index.html
(TextVQA) Towards VQA Models That Can Read
- website
- repo
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- repo
MMBench: Is Your Multi-modal Model an All-around Player?
BabyVision: Visual Reasoning Beyond Language

spatial

(ARO) When and why vision-language models behave like bags-of-words, and what to do about it?
(Whatsup) What’s “up” with vision-language models? Investigating their struggle with spatial reasoning
- [repo]https://github.com/amitakamath/whatsup_vlms
(VSR) Visual Spatial Reasoning
- [repo]https://github.com/cambridgeltl/visual-spatial-reasoning
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
(SpatialMQA) Can Multimodal Large Language Models Understand Spatial Relations?
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
(COMFORT) Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
(MindCube) Spatial Mental Modeling from Limited Views
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

video

Action100M: A Large-scale Video Action Dataset

hallucination

(POPE) Evaluating object hallucination in large vision-language models
- repo
(CHAIR) Object hallucination in image captioning
- repo
(OpenCHAIR) Mitigating Open-Vocabulary Caption Hallucinations
- website

Models

LLM

foundation models
- survey
  - LLM Architecture Gallery
- general
architecture

MLLM

foundation models
architecture
- ViT
  - repr
    - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
    - Eagle: Exploring the design space for multimodal llms with mixture of encoders
  - architecture
    - basic
    - others
      - Differentiable Hierarchical Visual Tokenization
      - (Quadtree) Accelerating Vision Transformers with Adaptive Patch Sizes
- VLM
  - arch
  - attention
    - VideoRoPE: What Makes for Good Video Rotary Position Embedding?

self-supervised learning

A Cookbook of Self-Supervised Learning
(SimCLR) A Simple Framework for Contrastive Learning of Visual Representations
(MoCo) Momentum Contrast for Unsupervised Visual Representation Learning
Beit: Bert pre-training of image transformers
(MAE) Masked Autoencoders Are Scalable Vision Learners
(iBOT) Image BERT: A New Vision-Language Pre-training Framework
(BYOL) Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
(DINO) Emerging Properties in Self-Supervised Vision Transformers
DINOv2: Learning Robust Visual Features without Supervision
DINOv3
Self-Supervised Learning 超详细解读 (目录)
万字长文超详解读之DINO全系列—视觉表征对比学习的高峰
万字长文超详解之DINO-V3（DINO全系列之补充篇）
DINO&DINO v2：颠覆自监督视觉特征表示学习

generative models

The Principles of Diffusion Models
video-generation-survey
视觉生成超详细解读 (目录)
blogs
basic
modules
- VAE
- attention
algorithms
- acceleration
image
video
3D / 4D
- Birth and Death of a Rose
audio
unified generation and comprehension
evaluation
- 图像生成常用指标：IS score 和 FID score

world models

survey
- Awesome World Models
blogs
- 3D/4D World Model（WM）近期发展的总结和思考
resources
- stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
  - repo
VLA
JEPA
- blogs
  - JEPA：自主机器智能的“蛋糕胚”——JEPA模型发展脉络梳理（第一弹）
- papers
world simulator
- A Summary
- (LingBot-World) Advancing Open-source World Models
navigation
- Navigation World Models
physical reasoning
- Denoising Hamiltonian Network for Physical Reasoning

Methods#

Interpretability for MLLMs#

Interpretability for Diffusion Models#

Other fields of MLLMs#

Datasets & Benchmarks#

general#

spatial#

video#

hallucination#

Models#

LLM#

MLLM#

self-supervised learning#

generative models#

world models#

Methods

Interpretability for MLLMs

Interpretability for Diffusion Models

Other fields of MLLMs

Datasets & Benchmarks

general

spatial

video

hallucination

Models

LLM

MLLM

self-supervised learning

generative models

world models