Skip to content

A survey on MM-LLMs for long video understanding: From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Notifications You must be signed in to change notification settings

Vincent-ZHQ/LV-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

LV-LLMs

A survey on MM-LLMs for long video understanding.

Related materials on From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Long video understanding MM-LLMs

Model Year Backbone Connector Frame Token Training Long
Model Visual Encoder LLMs Image-level Video-level Long-video-level Hardware PreT IT
InstructBLIP 23.05 EVA-CLIP-ViT-G/14 FlanT5, Vicuna-7B/13B Q-Former -- -- 4 32/128 16 A100-40G Y-N-N Y-N-N No
VideoChat 23.05 EVA-CLIP-ViT-G/14 StableVicuna-13B Q-Former Global multi-head relation aggregator -- 8 /32 1 A10 Y-Y-N Y-Y-N No
Video-LLaMA 23.06 EVA-CLIP-ViT-G/14 LLaMA, Vicuna Q-Former Q-Former -- 8 /32 - Y-Y-N Y-Y-N No
Video-ChatGPT 23.06 CLIP-ViT-L/14 Vicuna1.1-7B Spatial-pooling Temporal-pooling -- 100 /356 8 A100-40G N-N-N N-Y-N No
Valley 23.06 CLIP-ViT-L/14 StableVicuna-7B/13B -- Transformer and Mean pooling -- 0.5 fps /256+T 8 A100 80G Y-Y-N Y-Y-N No
MovieChat 23.07 EVA-CLIP-ViT-G/14 LLama-7B Q-Former Frame merging, Q-Former Merging adjacent frames 2048 32/32 - E2E E2E Yes
Qwen-VL 23.08 Openclip-ViT-bigG Qwen-7B Cross-attention -- -- 4 /256 - Y-N-N Y-N-N No
Chat-UniVi 23.11 CLIP-ViT-L/14 Vicuna1.5-7B Token merging -- -- 64 /112 - Y-N-N Y-Y-N No
Video-LLaVA 23.11 LanguageBind-ViT-L/14 Vicuna1.5-7B -- -- -- 8 256/2048 4 A100-80G Y-Y-N Y-Y-N No
LLaMA-VID 23.11 CLIP-ViT-L/14 Vicuna-7B/13B Context attention and pooling 1 fps 2/ 8 A100 Y-Y-N Y-Y-Y Yes
VTimeLLM 23.11 CLIP-ViT-L/14 Vicuna1.5-7B/13B Frame feature -- -- 100 1/100 1 RTX-4090 Y-Y-N N-Y-N Yes
VideoChat2 23.11 EVA-CLIP-ViT-G/14 Vicuna0-7B -- Q-Former -- 16 /96 - Y-Y-N Y-Y-N No
Vista-LLaMA 23.12 EVA-CLIP-ViT-G/14 LLaVa-Vicuna-7B Q-Former Temporal Q-Former -- 16 32/512 8 A100-80GB E2E E2E No
TimeChat 23.12 EVA-CLIP-ViT-G/14 LLaMA2-7B Q-Former Sliding window Q-Former Time-aware encoding 96 /96 8 V100-32G Y-Y-N N-N-Y Yes
VaQuitA 23.12 CLIP-ViT-L/14 LLaVA1.5-LLaMA-7B -- Video Perceiver, VQ-Former -- 100 /356 8 A100-80GB E2E E2E No
Dolphins 23.12 CLIP-ViT-L/14 OpenFlamingo Perceiver Resamplar, Gated cross-attention Time embedding -- -- 4 A100 N-Y-N Y-Y-N No
Momentor 24.02 CLIP-ViT-L/14 LLaMA-7B Frame feature, Temporal Perception Module, Grounded Event-Sequence Modeling 300 1/300 8 A100 Y-Y-N N-Y-N Yes
MovieLLM 24.03 CLIP-ViT-L/14 Vicuna-7B/13B Context attention and pooling 1 fps 2/ 4 A100 Y-Y-N Y-Y-Y Yes
MA-LMM 24.04 EVA-CLIP-ViT-G/14 Vicuna-7B Q-Former Memory Bank Compression Merging adjacent frames 100 /32 4 A100 E2E E2E Yes
PLLaVA 23.04 CLIP-ViT-L/14 LLaVA-Next-LLM Adaptive Pooling 64 2304 - Y-N-N Y-Y-N Yes
LongVLM 23.04 CLIP-ViT-L/14 Vicuna1.1-7B Hierarchical token merging 100 /305 4 A100 80G Y-N-N Y-Y-N Yes
MiniGPT4-Video 24.04 EVA-CLIP-ViT-G/14 LLaMA2-7B, Mistral-7B Merging adjacent tokens -- -- 90 64/5760 - Y-Y-N N-Y-N No
RED-VILLM 24.04 Openclip-ViT-bigG Qwen-7B Spatial pooling Temporal pooling -- 100 /1124 - Y-N-N Y-Y-N No
ST-LLM 24.04 BLIP-2 InstructBLIP-Vicuna1.1-7B Q-Former Masked video modeling Global-Local input 16 /512 8 A100 E2E E2E No
LLaVA-NeXT-Video 24.04 CLIP-ViT-L/14 Vicuna1.5-7B/13B, Nous-Hermes-2-Yi-34B Merging adjacent tokens -- -- 32 4608 - Y-Y-N Y-Y-N No
Mantis-Idefics2 24.05 SigLIP-SO400M Mistral0.1-7B Perceiver resampler -- -- 8 64/512 16 A100-40G Y-N-N N-Y-N No
VideoLLaMA 2 24.06 CLIP-ViT-L/14 Mistral-7B-Instruct Spatial-Temporal Convolution -- 8 /576 - Y-Y-N Y-Y-N No
LongVA 24.06 CLIP-ViT-L/14 Qwen2-7B-224K Merging adjacent tokens Expanding tokens -- 384 55,296 8x A100-80G - Y-N-N Yes
Artemis 24.06 CLIP-ViT-L/14 Vicuna1.5-7B Average pooling 5 /356 8 x A800 Y-Y-N N-Y-N No
VideoGPT+ 24.06 CLIP-ViT-L/14, InternVideo-v2 Phi3-Mini-3.8B Adaptive pooling Adaptive pooling -- 16 /2560 8 x A100 40G Y-Y-N N-Y-N No
IXC-2.5 24.07 CLIP-ViT-L/14-490 InternLM2-7B Merging adjacent tokens Expanding tokens Frame index 64 400/25600 - Y-Y-N Y-Y-N No
EVLM 24.07 EVA2-CLIP-E-Plus Qwen-14B-Chat 1.0 Gated cross attention -- -- -- /16 - Y-Y-N Y-Y-N No
SlowFast-LLaVA 24.07 CLIP-ViT-L/14 Vicuna1.5-7B Merging adjacent tokens Slow and fast pathway 50 3680 A100-80G - - Yes
LLaVA-Interleave 24.07 SigLIP-SO400M Qwen1.5-0.5B/7B/14B -- -- -- 16 729/11664 - Y-N-N Y-Y-N No
Kangaroo 24.08 EVA-CLIP-ViT-G/14 LLaMA3-8B 3D Depthwise convolution -- -- - Y-Y-N Y-Y-Y Yes
VITA 24.08 InternViT-300M-448px Mixtral 8x7B MLP -- -- 16 256/4096 - Y-Y-N Y-Y-N No
LLaVA-OneVision 24.08 SigLIP-SO400M Qwen2-7B Merging adjacent tokens -- -- 1 fps 729/ - Y-N-N Y-Y-N No
LONGVILA 24.08 SigLIP-SO400M Qwen2-1.5B/7B Multi-Modal Sequence Parallelism 1024 256/ 256 A100 80G Y-Y-N Y-Y-Y Yes
LongLLaVA 24.09 CLIP-ViT-B/32 LLaVA1.6-13B Merging adjacent tokens Mamba Layers Hybrid architecture 256 144/ 24 A800 80G Y-N-N Y-Y-N Yes
Qwen2-VL 24.09 CLIP-ViT-L/14 Qwen2-1.5B/7B/72B Merging adjacent tokens 3D convolutions -- 2 fps 66/ - Y-N-N Y-Y-N No
Video-XL 20.09 CLIP-ViT-L/14 Qwen-2-7B Merging adjacent tokens Visual Summarization Token and Dynamic Compression 128 -- 8 A800-80G Y-N-N Y-Y-N Yes
Oryx-1.5 24.10 OryxViT Qwen-2.5-7B/32B Variable-Length Self-Attention Dynamic Compressor 64 256/ 64 A800-80G Y-Y-N Y-Y-Y No
TimeMarker 24.11 LLaVA-Encoder LLaVA-LLM Adaptive Token Merge and Temporal Separator Tokens Integration 128 -- - Y-Y-N Y-Y-Y Yes
NVILA 24.12 SigLIP-SO400M Qwen2-7B/14B Spatial-to-Channel Reshaping Temporal Averaging 256 /8192 128 H100-80G Y-Y-N Y-Y-Y Yes

Long video understanding benchmarks

  1. Video-MME: Popular video understanding evaluation benchmark, including short, medium, and long video resources with 900 videos and 2,700 annotations. The average duration is 17.0 minutes. Project, GitHub, Dataset, Paper

  2. HourVideo: Hour-level video understanding evaluation benchmark, including long video resources of 500 videos and 12,976 annotations. The average duration is 45.7 minutes. Project, GitHub, Dataset, Paper

  3. HLV-1K: Hour-level video understanding evaluation benchmark, including long video resources of 1,009 videos and 14,847 annotations. The average duration is 55.0 minutes. Project, GitHub, Dataset, Paper

  4. LVBench: Hour-level video understanding evaluation benchmark, including long video resources of 103 videos and 1,549 annotations. The average duration is 68.4 minutes. Project, GitHub, Dataset, Paper

Performance on long video benchamarks

image

Performance on common video benchmarks

image

citation

If you use our code or find our CA-MSER useful in your research, please consider citing:

@article{zou2024seconds,
  title={From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding},
  author={Zou, Heqing and Luo, Tianze and Xie, Guiyang and Lv, Fengmao and Wang, Guangcong and Chen, Juanyang and Wang, Zhuochen and Zhang, Hansheng and Zhang, Huaijian and others},
  journal={arXiv preprint arXiv:2409.18938},
  year={2024}
}

About

A survey on MM-LLMs for long video understanding: From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published