💡 This post is initially focused on interpretability for multimodal models, while later a lot of papers in other fields are included, just for convenience.
Resource
Interpretability for MLLMs
- survey
- probing
- representation
- circuit
- SAE
- visualization
- others
- information flow
- **Cross-modal Information Flow in Multimodal Large Language Models
- *From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks
- *What’s in the Image? A Deep-Dive into the Vision of Vision Language Models
- The Narrow Gate: Localized Image-Text Communication in Vision-Language Models
- Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- analyses on MLLMs
Other fields of MLLMs
visual pretraining
spatial
good
- Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
- Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
- (ViT+LLM > ViT) Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
- ! Learning Visual Composition through Improved Semantic Guidance
- (prompt-based) Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
- (prompt-based) Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
- Probing the Role of Positional Information in Vision-Language Models
evaluation
REC
attention
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
- MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
- Lost in the middle: How language models use long contexts
- Efficient streaming language models with attention sinks
- (delimiters) Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
- EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
- MagicPIG: LSH Sampling for Efficient LLM Generation
- Label words are anchors: An information flow perspective for understanding in-context learning
- ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
- POS
- “闭门造车”之多模态思路浅谈(三):位置编码
- 社区供稿 | 图解RoPE旋转位置编码及其特性
- Base of RoPE Bounds Context Length
- Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models
- Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
- NoPE
- Transformer Language Models without Positional Encodings Still Learn Positional Information
- Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings
- The Impact of Positional Encoding on Length Generalization in Transformers
- (NoPE limits) Length Generalization of Causal Transformers without Position Encoding
- Long context
hallucination
- survey
- *Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- Interpreting and editing vision-language representations to mitigate hallucinations
- Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- Debiasing Multimodal Large Language Models
token compression
- survey
- methods
- (CDPruner) Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
- *AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
- (FasterVLM) [CLS] Attention is All You Need for Training-FreeVisual Token Pruning: Make VLM Inference Faster
- Sparsevlm: Visual token sparsification for efficient vision-language Model Inference
- (FastV) An image is worth 1/2 tokens after layer 2: Plug-and-PLay Acceleration for VLLM Inference
- LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
- *Inference Optimal VLMs Need Only One Visual Token but Larger Models
- TokenPacker: Efficient Visual Projector for Multimodal LLM
- Matryoshka Multimodal Models
- Matryoshka Query Transformer for Large Vision-Language Models
- FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
- FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
active perception Please refer to this post.
Dataset
- general
- VQA: Visual Question Answering
- [(VQA v2.0) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering]
- (GQA) GQA:ANewDataset for Real-World Visual Reasoning and Compositional Question Answering
- [website]https://cs.stanford.edu/people/dorarad/gqa/index.html
- (TextVQA) Towards VQA Models That Can Read
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- MMBench: Is Your Multi-modal Model an All-around Player?
- spatial
- (ARO) When and why vision-language models behave like bags-of-words, and what to do about it?
- (Whatsup) What’s “up” with vision-language models? Investigating their struggle with spatial reasoning
- [repo]https://github.com/amitakamath/whatsup_vlms
- (VSR) Visual Spatial Reasoning
- [repo]https://github.com/cambridgeltl/visual-spatial-reasoning
- SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
- (COMFORT) Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
- hallucination
Models
llm
vlm
- basic
- (Transformer) Attention is all you need
- (VIT) An Image is Worth 16x16 Words: Transformers for Image Recognition fat Scale
- (CLIP) Learning Transferable Visual Models From Natural Language Supervision
- (SigLIP) Sigmoid Loss for Language Image Pre-Training
- (PACL) Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
- (BLIP-2) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
- Qwen2.5-VL Technical Report
- (DFN) Data Filtering Networks
- LLaVA系列
- (LLaVA) Visual Instruction Tuning
- (LLaVA-1.5) Improved Baselines with Visual Instruction Tuning
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- (InternVL 1.5) How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
- (InternVL 2.5) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
- visual contrastive learning
- (SimCLR) A Simple Framework for Contrastive Learning of Visual Representations
- (MoCo) Momentum Contrast for Unsupervised Visual Representation Learning
- BeiT
- (MAE) Masked Autoencoders Are Scalable Vision Learners
- (iBOT) Image BERT: A New Vision-Language Pre-training Framework
- (BYOL) Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
- (DINO) Emerging Properties in Self-Supervised Vision Transformers
- DINOv2: Learning Robust Visual Features without Supervision
- DINOv3
- Self-Supervised Learning 超详细解读 (目录)
- 万字长文超详解读之DINO全系列—视觉表征对比学习的高峰
- 万字长文超详解之DINO-V3(DINO全系列之补充篇)
- DINO&DINO v2:颠覆自监督视觉特征表示学习
- resolution
- basic
generative models
image
basic (T2I)
- (VAE) Auto-Encoding Variational Bayes
- (DDPM) Denoising Diffusion Probabilistic Models
- (DDIM) Denoising Diffusion Implicit Models
- (Classifier-guided) Diffusion Models Beat GANs on Image Synthesis
- (Classifier-free) Classifier-free diffusion guidance
- (VQGAN) Taming transformers for high-resolution image synthesis
- (DIT) Scalable Diffusion Models with Transformers
- (LDM) High-resolution image synthesis with latent diffusion models
- (GLIDE) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- (Imagen)
- (DALL-E) Zero-Shot Text-to-Image Generation
- (DALL-E-2) Hierarchical Text-Conditional Image Generation with CLIP Latents
- (DALL-E-3) Improving Image Generation with Better Captions
- (ControlNet) Adding Conditional Control to Text-to-Image Diffusion Models
- Consistency Models
- (LCM) Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
- (SDXL Turbo) Adversarial Diffusion Distillation
- (EDM) Elucidating the Design Space of Diffusion-Based Generative Models
- Flow Matching for Generative Modeling
- (Stable Diffusion 3) Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- Stable Diffusion技术路线发展历程回顾
image editing
- InstructPix2Pix: Learning to Follow Image Editing Instructions
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
- Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
- Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry
- Diffusion Model-Based Image Editing: A Survey
video
survey
generation
- Video Diffusion Models
- Latte: Latent Diffusion Transformer for Video Generation
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
- Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
I2V
video editing
autogressive image generation
AGI
- unified generation and comprehension
- papers
- Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
- Janus-Pro
- Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
- TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
- MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
- (BAGEL) Emerging Properties in Unified Multimodal Pretraining
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
- Emu: Generative Pretraining in Multimodality
- (Emu2) Generative Multimodal Models are In-Context Learners
- Emu3: Next-Token Prediction is All You Need
- NExT-GPT: Any-to-Any Multimodal LLM
- blogs
- papers
- new mllm archs
- world models
blogs
VLA
JEPA
world simulator
“Embodied AI”
navigation
physical reasoning
- unified generation and comprehension