LV-LLMs

A survey on MM-LLMs for long video understanding.

Related materials on From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Long video understanding MM-LLMs

Model	Year	Backbone		Connector			Frame	Token	Training			Long
Model		Visual Encoder	LLMs	Image-level	Video-level	Long-video-level			Hardware	PreT	IT
InstructBLIP	23.05	EVA-CLIP-ViT-G/14	FlanT5, Vicuna-7B/13B	Q-Former	--	--	4	32/128	16 A100-40G	Y-N-N	Y-N-N	No
VideoChat	23.05	EVA-CLIP-ViT-G/14	StableVicuna-13B	Q-Former	Global multi-head relation aggregator	--	8	/32	1 A10	Y-Y-N	Y-Y-N	No
Video-LLaMA	23.06	EVA-CLIP-ViT-G/14	LLaMA, Vicuna	Q-Former	Q-Former	--	8	/32	-	Y-Y-N	Y-Y-N	No
Video-ChatGPT	23.06	CLIP-ViT-L/14	Vicuna1.1-7B	Spatial-pooling	Temporal-pooling	--	100	/356	8 A100-40G	N-N-N	N-Y-N	No
Valley	23.06	CLIP-ViT-L/14	StableVicuna-7B/13B	--	Transformer and Mean pooling	--	0.5 fps	/256+T	8 A100 80G	Y-Y-N	Y-Y-N	No
MovieChat	23.07	EVA-CLIP-ViT-G/14	LLama-7B	Q-Former	Frame merging, Q-Former	Merging adjacent frames	2048	32/32	-	E2E	E2E	Yes
Qwen-VL	23.08	Openclip-ViT-bigG	Qwen-7B	Cross-attention	--	--	4	/256	-	Y-N-N	Y-N-N	No
Chat-UniVi	23.11	CLIP-ViT-L/14	Vicuna1.5-7B	Token merging	--	--	64	/112	-	Y-N-N	Y-Y-N	No
Video-LLaVA	23.11	LanguageBind-ViT-L/14	Vicuna1.5-7B	--	--	--	8	256/2048	4 A100-80G	Y-Y-N	Y-Y-N	No
LLaMA-VID	23.11	CLIP-ViT-L/14	Vicuna-7B/13B	Context attention and pooling			1 fps	2/	8 A100	Y-Y-N	Y-Y-Y	Yes
VTimeLLM	23.11	CLIP-ViT-L/14	Vicuna1.5-7B/13B	Frame feature	--	--	100	1/100	1 RTX-4090	Y-Y-N	N-Y-N	Yes
VideoChat2	23.11	EVA-CLIP-ViT-G/14	Vicuna0-7B	--	Q-Former	--	16	/96	-	Y-Y-N	Y-Y-N	No
Vista-LLaMA	23.12	EVA-CLIP-ViT-G/14	LLaVa-Vicuna-7B	Q-Former	Temporal Q-Former	--	16	32/512	8 A100-80GB	E2E	E2E	No
TimeChat	23.12	EVA-CLIP-ViT-G/14	LLaMA2-7B	Q-Former	Sliding window Q-Former	Time-aware encoding	96	/96	8 V100-32G	Y-Y-N	N-N-Y	Yes
VaQuitA	23.12	CLIP-ViT-L/14	LLaVA1.5-LLaMA-7B	--	Video Perceiver, VQ-Former	--	100	/356	8 A100-80GB	E2E	E2E	No
Dolphins	23.12	CLIP-ViT-L/14	OpenFlamingo	Perceiver Resamplar, Gated cross-attention		Time embedding	--	--	4 A100	N-Y-N	Y-Y-N	No
Momentor	24.02	CLIP-ViT-L/14	LLaMA-7B	Frame feature, Temporal Perception Module, Grounded Event-Sequence Modeling			300	1/300	8 A100	Y-Y-N	N-Y-N	Yes
MovieLLM	24.03	CLIP-ViT-L/14	Vicuna-7B/13B	Context attention and pooling			1 fps	2/	4 A100	Y-Y-N	Y-Y-Y	Yes
MA-LMM	24.04	EVA-CLIP-ViT-G/14	Vicuna-7B	Q-Former	Memory Bank Compression	Merging adjacent frames	100	/32	4 A100	E2E	E2E	Yes
PLLaVA	23.04	CLIP-ViT-L/14	LLaVA-Next-LLM	Adaptive Pooling			64	2304	-	Y-N-N	Y-Y-N	Yes
LongVLM	23.04	CLIP-ViT-L/14	Vicuna1.1-7B	Hierarchical token merging			100	/305	4 A100 80G	Y-N-N	Y-Y-N	Yes
MiniGPT4-Video	24.04	EVA-CLIP-ViT-G/14	LLaMA2-7B, Mistral-7B	Merging adjacent tokens	--	--	90	64/5760	-	Y-Y-N	N-Y-N	No
RED-VILLM	24.04	Openclip-ViT-bigG	Qwen-7B	Spatial pooling	Temporal pooling	--	100	/1124	-	Y-N-N	Y-Y-N	No
ST-LLM	24.04	BLIP-2	InstructBLIP-Vicuna1.1-7B	Q-Former	Masked video modeling	Global-Local input	16	/512	8 A100	E2E	E2E	No
LLaVA-NeXT-Video	24.04	CLIP-ViT-L/14	Vicuna1.5-7B/13B, Nous-Hermes-2-Yi-34B	Merging adjacent tokens	--	--	32	4608	-	Y-Y-N	Y-Y-N	No
Mantis-Idefics2	24.05	SigLIP-SO400M	Mistral0.1-7B	Perceiver resampler	--	--	8	64/512	16 A100-40G	Y-N-N	N-Y-N	No
VideoLLaMA 2	24.06	CLIP-ViT-L/14	Mistral-7B-Instruct	Spatial-Temporal Convolution		--	8	/576	-	Y-Y-N	Y-Y-N	No
LongVA	24.06	CLIP-ViT-L/14	Qwen2-7B-224K	Merging adjacent tokens	Expanding tokens	--	384	55,296	8x A100-80G	-	Y-N-N	Yes
Artemis	24.06	CLIP-ViT-L/14	Vicuna1.5-7B	Average pooling			5	/356	8 x A800	Y-Y-N	N-Y-N	No
VideoGPT+	24.06	CLIP-ViT-L/14, InternVideo-v2	Phi3-Mini-3.8B	Adaptive pooling	Adaptive pooling	--	16	/2560	8 x A100 40G	Y-Y-N	N-Y-N	No
IXC-2.5	24.07	CLIP-ViT-L/14-490	InternLM2-7B	Merging adjacent tokens	Expanding tokens	Frame index	64	400/25600	-	Y-Y-N	Y-Y-N	No
EVLM	24.07	EVA2-CLIP-E-Plus	Qwen-14B-Chat 1.0	Gated cross attention	--	--	--	/16	-	Y-Y-N	Y-Y-N	No
SlowFast-LLaVA	24.07	CLIP-ViT-L/14	Vicuna1.5-7B	Merging adjacent tokens	Slow and fast pathway		50	3680	A100-80G	-	-	Yes
LLaVA-Interleave	24.07	SigLIP-SO400M	Qwen1.5-0.5B/7B/14B	--	--	--	16	729/11664	-	Y-N-N	Y-Y-N	No
Kangaroo	24.08	EVA-CLIP-ViT-G/14	LLaMA3-8B	3D Depthwise convolution			--	--	-	Y-Y-N	Y-Y-Y	Yes
VITA	24.08	InternViT-300M-448px	Mixtral 8x7B	MLP	--	--	16	256/4096	-	Y-Y-N	Y-Y-N	No
LLaVA-OneVision	24.08	SigLIP-SO400M	Qwen2-7B	Merging adjacent tokens	--	--	1 fps	729/	-	Y-N-N	Y-Y-N	No
LONGVILA	24.08	SigLIP-SO400M	Qwen2-1.5B/7B	Multi-Modal Sequence Parallelism			1024	256/	256 A100 80G	Y-Y-N	Y-Y-Y	Yes
LongLLaVA	24.09	CLIP-ViT-B/32	LLaVA1.6-13B	Merging adjacent tokens	Mamba Layers	Hybrid architecture	256	144/	24 A800 80G	Y-N-N	Y-Y-N	Yes
Qwen2-VL	24.09	CLIP-ViT-L/14	Qwen2-1.5B/7B/72B	Merging adjacent tokens	3D convolutions	--	2 fps	66/	-	Y-N-N	Y-Y-N	No
Video-XL	20.09	CLIP-ViT-L/14	Qwen-2-7B	Merging adjacent tokens	Visual Summarization Token and Dynamic Compression		128	--	8 A800-80G	Y-N-N	Y-Y-N	Yes
Oryx-1.5	24.10	OryxViT	Qwen-2.5-7B/32B	Variable-Length Self-Attention	Dynamic Compressor		64	256/	64 A800-80G	Y-Y-N	Y-Y-Y	No
TimeMarker	24.11	LLaVA-Encoder	LLaVA-LLM	Adaptive Token Merge and Temporal Separator Tokens Integration			128	--	-	Y-Y-N	Y-Y-Y	Yes
NVILA	24.12	SigLIP-SO400M	Qwen2-7B/14B	Spatial-to-Channel Reshaping	Temporal Averaging		256	/8192	128 H100-80G	Y-Y-N	Y-Y-Y	Yes

Long video understanding benchmarks

Video-MME: Popular video understanding evaluation benchmark, including short, medium, and long video resources with 900 videos and 2,700 annotations. The average duration is 17.0 minutes. Project, GitHub, Dataset, Paper
HourVideo: Hour-level video understanding evaluation benchmark, including long video resources of 500 videos and 12,976 annotations. The average duration is 45.7 minutes. Project, GitHub, Dataset, Paper
HLV-1K: Hour-level video understanding evaluation benchmark, including long video resources of 1,009 videos and 14,847 annotations. The average duration is 55.0 minutes. Project, GitHub, Dataset, Paper
LVBench: Hour-level video understanding evaluation benchmark, including long video resources of 103 videos and 1,549 annotations. The average duration is 68.4 minutes. Project, GitHub, Dataset, Paper

Performance on long video benchamarks

Performance on common video benchmarks

citation

If you use our code or find our CA-MSER useful in your research, please consider citing:

@article{zou2024seconds,
  title={From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding},
  author={Zou, Heqing and Luo, Tianze and Xie, Guiyang and Lv, Fengmao and Wang, Guangcong and Chen, Juanyang and Wang, Zhuochen and Zhang, Hansheng and Zhang, Huaijian and others},
  journal={arXiv preprint arXiv:2409.18938},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LV-LLMs

Long video understanding MM-LLMs

Long video understanding benchmarks

Performance on long video benchamarks

Performance on common video benchmarks

citation

About

Releases

Packages

Vincent-ZHQ/LV-LLMs

Folders and files

Latest commit

History

Repository files navigation

LV-LLMs

Long video understanding MM-LLMs

Long video understanding benchmarks

Performance on long video benchamarks

Performance on common video benchmarks

citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages