💡 This post is initially focused on interpretability for multimodal models, while later a lot of papers in other fields are included, just for convenience.
Resource
Interpretability for MLLMs
- survey
- probing
- representation
- Zoom in: An introduction to circuits
- Multimodal Neurons in Artificial Neural Networks
- Interpreting CLIP’s Image Representation via Text-Based Decomposition
- Interpreting the Second-Order Effects of Neurons in CLIP
- CLIP不同层
- Multimodal Neurons in Pretrained Text-Only Transformers
- Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?
- circuit
- SAE
- visualization
- others
- **Towards interpreting visual information processing in vision-language models
- (dogit lens) Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
- Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models
- tools
- information flow
- **Cross-modal Information Flow in Multimodal Large Language Models
- *From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks
- *What’s in the Image? A Deep-Dive into the Vision of Vision Language Models
- The Narrow Gate: Localized Image-Text Communication in Vision-Language Models
- Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- analyses on MLLMs
- Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
- Lost in Embeddings: Information Loss in Vision–Language Models
- Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
- Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
- Vision Transformers Need Registers
- On the rankability of visual embeddings
Other fields of MLLMs
visual pretraining
spatial
models
analyses
- Understanding How Positional Encodings Work in Transformer Model
- Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
- Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
- (ViT+LLM > ViT) Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
- ! Learning Visual Composition through Improved Semantic Guidance
- (prompt-based) Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
- (prompt-based) Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
- Probing the Role of Positional Information in Vision-Language Models
cognitive linguistics / positional metaphor
REC
attention
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
- MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
- Lost in the middle: How language models use long contexts
- Efficient streaming language models with attention sinks
- (delimiters) Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
- EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
- MagicPIG: LSH Sampling for Efficient LLM Generation
- Label words are anchors: An information flow perspective for understanding in-context learning
- ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
- POS
- “闭门造车”之多模态思路浅谈(三):位置编码
- 社区供稿 | 图解RoPE旋转位置编码及其特性
- Base of RoPE Bounds Context Length
- Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models
- Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
- NoPE
- Transformer Language Models without Positional Encodings Still Learn Positional Information
- Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings
- The Impact of Positional Encoding on Length Generalization in Transformers
- (NoPE limits) Length Generalization of Causal Transformers without Position Encoding
- Long context
hallucination
- survey
- *Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- Interpreting and editing vision-language representations to mitigate hallucinations
- Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- Debiasing Multimodal Large Language Models
token compression
- survey
- methods
- (CDPruner) Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
- *AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
- (FasterVLM) [CLS] Attention is All You Need for Training-FreeVisual Token Pruning: Make VLM Inference Faster
- Sparsevlm: Visual token sparsification for efficient vision-language Model Inference
- (FastV) An image is worth 1/2 tokens after layer 2: Plug-and-PLay Acceleration for VLLM Inference
- LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
- *Inference Optimal VLMs Need Only One Visual Token but Larger Models
- TokenPacker: Efficient Visual Projector for Multimodal LLM
- Matryoshka Multimodal Models
- Matryoshka Query Transformer for Large Vision-Language Models
- FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
- FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
active perception Please refer to this post.
Datasets & Benchmarks
general
- VQA: Visual Question Answering
- [(VQA v2.0) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering]
- (GQA) GQA:ANewDataset for Real-World Visual Reasoning and Compositional Question Answering
- [website]https://cs.stanford.edu/people/dorarad/gqa/index.html
- (TextVQA) Towards VQA Models That Can Read
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- MMBench: Is Your Multi-modal Model an All-around Player?
spatial
- (ARO) When and why vision-language models behave like bags-of-words, and what to do about it?
- (Whatsup) What’s “up” with vision-language models? Investigating their struggle with spatial reasoning
- [repo]https://github.com/amitakamath/whatsup_vlms
- (VSR) Visual Spatial Reasoning
- [repo]https://github.com/cambridgeltl/visual-spatial-reasoning
- SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
- (SpatialMQA) Can Multimodal Large Language Models Understand Spatial Relations?
- SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
- (COMFORT) Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
- ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
- (MindCube) Spatial Mental Modeling from Limited Views
- How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
hallucination
- (POPE) Evaluating object hallucination in large vision-language models
- (CHAIR) Object hallucination in image captioning
- (OpenCHAIR) Mitigating Open-Vocabulary Caption Hallucinations
Models
llm
vlm
- basic
- (Transformer) Attention is all you need
- (VIT) An Image is Worth 16x16 Words: Transformers for Image Recognition fat Scale
- (CLIP) Learning Transferable Visual Models From Natural Language Supervision
- (SigLIP) Sigmoid Loss for Language Image Pre-Training
- (PACL) Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
- (BLIP-2) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
- Qwen2.5-VL Technical Report
- (DFN) Data Filtering Networks
- LLaVA系列
- (LLaVA) Visual Instruction Tuning
- (LLaVA-1.5) Improved Baselines with Visual Instruction Tuning
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- (InternVL 1.5) How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
- (InternVL 2.5) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
- visual contrastive learning
- (SimCLR) A Simple Framework for Contrastive Learning of Visual Representations
- (MoCo) Momentum Contrast for Unsupervised Visual Representation Learning
- Beit: Bert pre-training of image transformers
- (MAE) Masked Autoencoders Are Scalable Vision Learners
- (iBOT) Image BERT: A New Vision-Language Pre-training Framework
- (BYOL) Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
- (DINO) Emerging Properties in Self-Supervised Vision Transformers
- DINOv2: Learning Robust Visual Features without Supervision
- DINOv3
- Self-Supervised Learning 超详细解读 (目录)
- 万字长文超详解读之DINO全系列—视觉表征对比学习的高峰
- 万字长文超详解之DINO-V3(DINO全系列之补充篇)
- DINO&DINO v2:颠覆自监督视觉特征表示学习
- resolution
- new mllm archs
generative models
blogs
- Stable Diffusion技术路线发展历程回顾
- Understanding Diffusion Models: A Unified Perspective
- 浅谈扩散模型的有分类器引导和无分类器引导
- 扩散模型与能量模型,Score-Matching和SDE,ODE的关系
- Difussion Model、Flow Matching 与 Rectified Flow 浅析
- SD3的采样上篇——Flow Matching
- SD3的采样下篇——Rectified Flow
- 深入解析Flow Matching技术
- 一文通透流匹配Flow Matching:作为扩散模型的变体,广泛应用于文生图和具身动作模型去噪中(含Rectified Flow的详解)
- 生成模型大道至简|Rectified Flow基础概念|代码
- FLUX.1概要——原SD核心团队推出的最强文生图
- FLUX.1 原理与源码解析
- Stable Diffusion 3 论文及源码概览
basic
- (VAE) Auto-Encoding Variational Bayes
- (VQGAN) Taming transformers for high-resolution image synthesis
- (LlamaGen) Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
- (DDPM) Denoising Diffusion Probabilistic Models
- (DDIM) Denoising Diffusion Implicit Models
- (Classifier-guided) Diffusion Models Beat GANs on Image Synthesis
- (Classifier-free) Classifier-free diffusion guidance
- Generative Modeling by Estimating Gradients of the Data Distribution
- Score-Based Generative Modeling through Stochastic Differential Equations
- (EDM) Elucidating the Design Space of Diffusion-Based Generative Modelsßßß
- (GLIDE) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- (Imagen) Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- (DALL-E) Zero-Shot Text-to-Image Generation
- (DALL-E-2) Hierarchical Text-Conditional Image Generation with CLIP Latents
- (DALL-E-3) Improving Image Generation with Better Captions
- (LDM) High-resolution image synthesis with latent diffusion models
- (ControlNet) Adding Conditional Control to Text-to-Image Diffusion Models
- (UViT) All are Worth Words: A ViT Backbone for Diffusion Models
- (DiT) Scalable Diffusion Models with Transformers
- Consistency Models
- (LCM) Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- (SDXL Turbo) Adversarial Diffusion Distillation
- Flow Matching for Generative Modeling
- Flow Matching Guide and Code
- (Rectified Flow) Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
- (Stable Diffusion 3, MMDiT) Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- (VAR) Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
modules
- VAE
- (VA-VAE) Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- (REPA) Representation alignment for generation: Training diffusion transformers is easier than you think
- REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
- Diffusion Transformers with Representation Autoencoders
- attention
- VAE
image
- models
- image editing
- Prompt-to-Prompt Image Editing with Cross Attention Control
- InstructPix2Pix: Learning to Follow Image Editing Instructions
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
- Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
- Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry
- Diffusion Model-Based Image Editing: A Survey
video
- survey
- blogs
- models
- VideoGPT: Video Generation using VQ-VAE and Transformers
- ViViT: A Video Vision Transformer
- Video Diffusion Models
- Imagen Video: High Definition Video Generation with Diffusion Models
- Make-A-Video: Text-to-Video Generation without Text-Video Data
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- (Gen-1) Structure and Content-Guided Video Synthesis with Diffusion Models
- VDT: General-purpose Video Diffusion Transformers via Mask Modeling
- Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
- (PixelDance) Make Pixels Dance: High-Dynamic Video Generation
- VideoPoet: A Large Language Model for Zero-Shot Video Generation
- (Video-LDM) Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
- (SVD) Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
- Lumiere: A Space-Time Diffusion Model for Video Generation
- Latte: Latent Diffusion Transformer for Video Generation
- (Sora) Video generation models as world simulators
- Open-Sora: Democratizing Efficient Video Production for All
- Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models
- Seedance 1.0: Exploring the Boundaries of Video Generation Models
- LongCat-Video Technical Report
- (Veo3 test) Video models are zero-shot learners and reasoners
- long video
- diffusion forcing
- video editing
- RL
- stylization
audio
unified generation and comprehension
- survey
- models
- unified
- separate
- Janus-Pro
- JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
- (MetaQuery) Transfer between Modalities with MetaQueries
- (BAGEL) Emerging Properties in Unified Multimodal Pretraining
- Emu: Generative Pretraining in Multimodality
- (Emu2) Generative Multimodal Models are In-Context Learners
- Emu3: Next-Token Prediction is All You Need
- NExT-GPT: Any-to-Any Multimodal LLM
- other relevant papers
- blogs
evaluation
world models
blogs
VLA
JEPA
world simulator
“Embodied AI”
navigation
physical reasoning