A survey on MM-LLMs for long video understanding.
Related materials on From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Model | Year | Backbone | Connector | Frame | Token | Training | Long | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | Visual Encoder | LLMs | Image-level | Video-level | Long-video-level | Hardware | PreT | IT | ||||
InstructBLIP | 23.05 | EVA-CLIP-ViT-G/14 | FlanT5, Vicuna-7B/13B | Q-Former | -- | -- | 4 | 32/128 | 16 A100-40G | Y-N-N | Y-N-N | No |
VideoChat | 23.05 | EVA-CLIP-ViT-G/14 | StableVicuna-13B | Q-Former | Global multi-head relation aggregator | -- | 8 | /32 | 1 A10 | Y-Y-N | Y-Y-N | No |
Video-LLaMA | 23.06 | EVA-CLIP-ViT-G/14 | LLaMA, Vicuna | Q-Former | Q-Former | -- | 8 | /32 | - | Y-Y-N | Y-Y-N | No |
Video-ChatGPT | 23.06 | CLIP-ViT-L/14 | Vicuna1.1-7B | Spatial-pooling | Temporal-pooling | -- | 100 | /356 | 8 A100-40G | N-N-N | N-Y-N | No |
Valley | 23.06 | CLIP-ViT-L/14 | StableVicuna-7B/13B | -- | Transformer and Mean pooling | -- | 0.5 fps | /256+T | 8 A100 80G | Y-Y-N | Y-Y-N | No |
MovieChat | 23.07 | EVA-CLIP-ViT-G/14 | LLama-7B | Q-Former | Frame merging, Q-Former | Merging adjacent frames | 2048 | 32/32 | - | E2E | E2E | Yes |
Qwen-VL | 23.08 | Openclip-ViT-bigG | Qwen-7B | Cross-attention | -- | -- | 4 | /256 | - | Y-N-N | Y-N-N | No |
Chat-UniVi | 23.11 | CLIP-ViT-L/14 | Vicuna1.5-7B | Token merging | -- | -- | 64 | /112 | - | Y-N-N | Y-Y-N | No |
Video-LLaVA | 23.11 | LanguageBind-ViT-L/14 | Vicuna1.5-7B | -- | -- | -- | 8 | 256/2048 | 4 A100-80G | Y-Y-N | Y-Y-N | No |
LLaMA-VID | 23.11 | CLIP-ViT-L/14 | Vicuna-7B/13B | Context attention and pooling | 1 fps | 2/ | 8 A100 | Y-Y-N | Y-Y-Y | Yes | ||
VTimeLLM | 23.11 | CLIP-ViT-L/14 | Vicuna1.5-7B/13B | Frame feature | -- | -- | 100 | 1/100 | 1 RTX-4090 | Y-Y-N | N-Y-N | Yes |
VideoChat2 | 23.11 | EVA-CLIP-ViT-G/14 | Vicuna0-7B | -- | Q-Former | -- | 16 | /96 | - | Y-Y-N | Y-Y-N | No |
Vista-LLaMA | 23.12 | EVA-CLIP-ViT-G/14 | LLaVa-Vicuna-7B | Q-Former | Temporal Q-Former | -- | 16 | 32/512 | 8 A100-80GB | E2E | E2E | No |
TimeChat | 23.12 | EVA-CLIP-ViT-G/14 | LLaMA2-7B | Q-Former | Sliding window Q-Former | Time-aware encoding | 96 | /96 | 8 V100-32G | Y-Y-N | N-N-Y | Yes |
VaQuitA | 23.12 | CLIP-ViT-L/14 | LLaVA1.5-LLaMA-7B | -- | Video Perceiver, VQ-Former | -- | 100 | /356 | 8 A100-80GB | E2E | E2E | No |
Dolphins | 23.12 | CLIP-ViT-L/14 | OpenFlamingo | Perceiver Resamplar, Gated cross-attention | Time embedding | -- | -- | 4 A100 | N-Y-N | Y-Y-N | No | |
Momentor | 24.02 | CLIP-ViT-L/14 | LLaMA-7B | Frame feature, Temporal Perception Module, Grounded Event-Sequence Modeling | 300 | 1/300 | 8 A100 | Y-Y-N | N-Y-N | Yes | ||
MovieLLM | 24.03 | CLIP-ViT-L/14 | Vicuna-7B/13B | Context attention and pooling | 1 fps | 2/ | 4 A100 | Y-Y-N | Y-Y-Y | Yes | ||
MA-LMM | 24.04 | EVA-CLIP-ViT-G/14 | Vicuna-7B | Q-Former | Memory Bank Compression | Merging adjacent frames | 100 | /32 | 4 A100 | E2E | E2E | Yes |
PLLaVA | 23.04 | CLIP-ViT-L/14 | LLaVA-Next-LLM | Adaptive Pooling | 64 | 2304 | - | Y-N-N | Y-Y-N | Yes | ||
LongVLM | 23.04 | CLIP-ViT-L/14 | Vicuna1.1-7B | Hierarchical token merging | 100 | /305 | 4 A100 80G | Y-N-N | Y-Y-N | Yes | ||
MiniGPT4-Video | 24.04 | EVA-CLIP-ViT-G/14 | LLaMA2-7B, Mistral-7B | Merging adjacent tokens | -- | -- | 90 | 64/5760 | - | Y-Y-N | N-Y-N | No |
RED-VILLM | 24.04 | Openclip-ViT-bigG | Qwen-7B | Spatial pooling | Temporal pooling | -- | 100 | /1124 | - | Y-N-N | Y-Y-N | No |
ST-LLM | 24.04 | BLIP-2 | InstructBLIP-Vicuna1.1-7B | Q-Former | Masked video modeling | Global-Local input | 16 | /512 | 8 A100 | E2E | E2E | No |
LLaVA-NeXT-Video | 24.04 | CLIP-ViT-L/14 | Vicuna1.5-7B/13B, Nous-Hermes-2-Yi-34B | Merging adjacent tokens | -- | -- | 32 | 4608 | - | Y-Y-N | Y-Y-N | No |
Mantis-Idefics2 | 24.05 | SigLIP-SO400M | Mistral0.1-7B | Perceiver resampler | -- | -- | 8 | 64/512 | 16 A100-40G | Y-N-N | N-Y-N | No |
VideoLLaMA 2 | 24.06 | CLIP-ViT-L/14 | Mistral-7B-Instruct | Spatial-Temporal Convolution | -- | 8 | /576 | - | Y-Y-N | Y-Y-N | No | |
LongVA | 24.06 | CLIP-ViT-L/14 | Qwen2-7B-224K | Merging adjacent tokens | Expanding tokens | -- | 384 | 55,296 | 8x A100-80G | - | Y-N-N | Yes |
Artemis | 24.06 | CLIP-ViT-L/14 | Vicuna1.5-7B | Average pooling | 5 | /356 | 8 x A800 | Y-Y-N | N-Y-N | No | ||
VideoGPT+ | 24.06 | CLIP-ViT-L/14, InternVideo-v2 | Phi3-Mini-3.8B | Adaptive pooling | Adaptive pooling | -- | 16 | /2560 | 8 x A100 40G | Y-Y-N | N-Y-N | No |
IXC-2.5 | 24.07 | CLIP-ViT-L/14-490 | InternLM2-7B | Merging adjacent tokens | Expanding tokens | Frame index | 64 | 400/25600 | - | Y-Y-N | Y-Y-N | No |
EVLM | 24.07 | EVA2-CLIP-E-Plus | Qwen-14B-Chat 1.0 | Gated cross attention | -- | -- | -- | /16 | - | Y-Y-N | Y-Y-N | No |
SlowFast-LLaVA | 24.07 | CLIP-ViT-L/14 | Vicuna1.5-7B | Merging adjacent tokens | Slow and fast pathway | 50 | 3680 | A100-80G | - | - | Yes | |
LLaVA-Interleave | 24.07 | SigLIP-SO400M | Qwen1.5-0.5B/7B/14B | -- | -- | -- | 16 | 729/11664 | - | Y-N-N | Y-Y-N | No |
Kangaroo | 24.08 | EVA-CLIP-ViT-G/14 | LLaMA3-8B | 3D Depthwise convolution | -- | -- | - | Y-Y-N | Y-Y-Y | Yes | ||
VITA | 24.08 | InternViT-300M-448px | Mixtral 8x7B | MLP | -- | -- | 16 | 256/4096 | - | Y-Y-N | Y-Y-N | No |
LLaVA-OneVision | 24.08 | SigLIP-SO400M | Qwen2-7B | Merging adjacent tokens | -- | -- | 1 fps | 729/ | - | Y-N-N | Y-Y-N | No |
LONGVILA | 24.08 | SigLIP-SO400M | Qwen2-1.5B/7B | Multi-Modal Sequence Parallelism | 1024 | 256/ | 256 A100 80G | Y-Y-N | Y-Y-Y | Yes | ||
LongLLaVA | 24.09 | CLIP-ViT-B/32 | LLaVA1.6-13B | Merging adjacent tokens | Mamba Layers | Hybrid architecture | 256 | 144/ | 24 A800 80G | Y-N-N | Y-Y-N | Yes |
Qwen2-VL | 24.09 | CLIP-ViT-L/14 | Qwen2-1.5B/7B/72B | Merging adjacent tokens | 3D convolutions | -- | 2 fps | 66/ | - | Y-N-N | Y-Y-N | No |
Video-XL | 20.09 | CLIP-ViT-L/14 | Qwen-2-7B | Merging adjacent tokens | Visual Summarization Token and Dynamic Compression | 128 | -- | 8 A800-80G | Y-N-N | Y-Y-N | Yes | |
Oryx-1.5 | 24.10 | OryxViT | Qwen-2.5-7B/32B | Variable-Length Self-Attention | Dynamic Compressor | 64 | 256/ | 64 A800-80G | Y-Y-N | Y-Y-Y | No | |
TimeMarker | 24.11 | LLaVA-Encoder | LLaVA-LLM | Adaptive Token Merge and Temporal Separator Tokens Integration | 128 | -- | - | Y-Y-N | Y-Y-Y | Yes | ||
NVILA | 24.12 | SigLIP-SO400M | Qwen2-7B/14B | Spatial-to-Channel Reshaping | Temporal Averaging | 256 | /8192 | 128 H100-80G | Y-Y-N | Y-Y-Y | Yes |
-
Video-MME: Popular video understanding evaluation benchmark, including short, medium, and long video resources with 900 videos and 2,700 annotations. The average duration is 17.0 minutes. Project, GitHub, Dataset, Paper
-
HourVideo: Hour-level video understanding evaluation benchmark, including long video resources of 500 videos and 12,976 annotations. The average duration is 45.7 minutes. Project, GitHub, Dataset, Paper
-
HLV-1K: Hour-level video understanding evaluation benchmark, including long video resources of 1,009 videos and 14,847 annotations. The average duration is 55.0 minutes. Project, GitHub, Dataset, Paper
-
LVBench: Hour-level video understanding evaluation benchmark, including long video resources of 103 videos and 1,549 annotations. The average duration is 68.4 minutes. Project, GitHub, Dataset, Paper


If you use our code or find our CA-MSER useful in your research, please consider citing:
@article{zou2024seconds,
title={From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding},
author={Zou, Heqing and Luo, Tianze and Xie, Guiyang and Lv, Fengmao and Wang, Guangcong and Chen, Juanyang and Wang, Zhuochen and Zhang, Hansheng and Zhang, Huaijian and others},
journal={arXiv preprint arXiv:2409.18938},
year={2024}
}