MM_Interp
Resource dataset (GQA) GQA:ANewDataset for Real-World Visual Reasoning and Compositional Question Answering https://cs.stanford.edu/people/dorarad/gqa/index.html image token compression (multimodal image token compression) *AdaFV: Rethinking of Visual-Language alignment for VLM acceleration (FasterVLM) [CLS] Attention is All You Need for Training-FreeVisual Token Pruning: Make VLM Inference Faster Sparsevlm: Visual token sparsification for efficient vision-languag (FastV) An image is worth 1/2 tokens after layer 2: Plug-and-PLay Acceleration for VLLM Inference LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token *Inference Optimal VLMs Need Only One Visual Token but Larger Models TokenPacker: Efficient Visual Projector for Multimodal LLM Matryoshka Multimodal Models Matryoshka Query Transformer for Large Vision-Language Models FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding spatial ...