Resource
dataset
image token compression
(multimodal image token compression)- *AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
- (FasterVLM) [CLS] Attention is All You Need for Training-FreeVisual Token Pruning: Make VLM Inference Faster
- Sparsevlm: Visual token sparsification for efficient vision-languag
- (FastV) An image is worth 1/2 tokens after layer 2: Plug-and-PLay Acceleration for VLLM Inference
- LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
- *Inference Optimal VLMs Need Only One Visual Token but Larger Models
- TokenPacker: Efficient Visual Projector for Multimodal LLM
- Matryoshka Multimodal Models
- Matryoshka Query Transformer for Large Vision-Language Models
- FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
- FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
spatial
- good
- Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
- Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
- (ViT+LLM > ViT) Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
- ! Learning Visual Composition through Improved Semantic Guidance
- (prompt-based) Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
- (prompt-based) Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
- Probing the Role of Positional Information in Vision-Language Models
- dataset
- (ARO) When and why vision-language models behave like bags-of-words, and what to do about it?
- (Whatsup) What’s “up” with vision-language models? Investigating their struggle with spatial reasoning
- (VSR) Visual Spatial Reasoning
- (GQA) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
- evaluation
- REC
- good
attention
- LVLM-Intrepret: An Interpretability Tool for Large Vision Language Models
- MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
- Lost in the middle: How language models use long contexts
- Efficient streaming language models with attention sinks
- (delimiters) Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
- EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
- MagicPIG: LSH Sampling for Efficient LLM Generation
- Label words are anchors: An information flow perspective for understanding in-context learning
- POS
- Transformer Language Models without Positional Encodings Still Learn Positional Information
- Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings
- Length Generalization of Causal Transformers without Position Encoding
- Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models
- Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
hallucination
- survey
- *Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
- Interpreting and editing vision-language representations to mitigate hallucinations
- Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- Debiasing Multimodal Large Language Models
image tokens
interp
- survey
- information flow
- *What’s in the Image? A Deep-Dive into the Vision of Vision Language Models
- The Narrow Gate: Localized Image-Text Communication in Vision-Language Models
- Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- **Cross-modal Information Flow in Multimodal Large Language Models
- *From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks
- others
- **Towards interpreting visual information processing in vision-language models
- **(causal tracing) Understanding Information Storage and Transfer in Multi-modal Large Language Models
- (dogit lens) Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space
- CLIP
- heat map
Sources
Papers
vlm
- basic
- (Transformer) Attention is all you need
- (VIT) An Image is Worth 16x16 Words: Transformers for Image Recognition fat Scale
- (CLIP) Learning Transferable Visual Models From Natural Language Supervision
- (PACL) Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
- (DINO) Emerging Properties in Self-Supervised Vision Transformers
- DINOv2: Learning Robust Visual Features without Supervision
- (BLIP-2) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
- Qwen2.5-VL Technical Report
- (DFN) Data Filtering Networks
- LLaVA系列
- (LLaVA) Visual Instruction Tuning
- (LLaVA-1.5) Improved Baselines with Visual Instruction Tuning
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- (InternVL 2.5) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
- resolution
- basic
generative models
image
basic (T2I)
- (DDPM) Denoising Diffusion Probabilistic Models
- (DDIM) Denoising Diffusion Implicit Models
- (Classifier-guided) Diffusion Models Beat GANs on Image Synthesis
- (Classifier-free) Classifier-free diffusion guidance
- (VQVAE) Taming transformers for high-resolution image synthesis
- (DIT) Scalable Diffusion Models with Transformers
- (LDM) High-resolution image synthesis with latent diffusion models
- (GLIDE) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
- (Imagen)
- (DALL-E) Zero-Shot Text-to-Image Generation
- (DALL-E-2) Hierarchical Text-Conditional Image Generation with CLIP Latents
- (DALL-E-3) Improving Image Generation with Better Captions
- (ControlNet) Adding Conditional Control to Text-to-Image Diffusion Models
- Consistency Models
- (LCM) Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
- (SDXL Turbo) Adversarial Diffusion Distillation
- (EDM) Elucidating the Design Space of Diffusion-Based Generative Models
- Flow Matching for Generative Modeling
- (Stable Diffusion 3) Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
image editing
- InstructPix2Pix: Learning to Follow Image Editing Instructions
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
- Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
- Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry
- Diffusion Model-Based Image Editing: A Survey
video
survey
generation
- Video Diffusion Models
- Latte: Latent Diffusion Transformer for Video Generation
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
- Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
I2V
video editing
AGI
- world models
navigation
spatial reasoning
physical reasoning
- world models
Analysis
Model capabilities