LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Hongyu Li, Jinyu Chen*, Ziyu Wei*, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu

This repository will provide the details and code for our model, dataset, and benchmark for LLaVA-ST, a model designed for fine-grained spatial-temporal multimodal understanding.

📰 News

[2025.01.15] 📄 Our paper is now available on Arxiv.

📝 Abstract

Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST , a MLLM for fine-grained spatial-temporal multimodal understanding. Our innovations include Language-Aligned Positional Embedding and the Spatial-Temporal Packer. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-Align dataset, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages. Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks. Our method achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding.

😲 First MLLM with Spatial-Temporal Fine-Grained Understanding Capacity

LLaVA-ST demonstrates high performance across various tasks of fine-grained multimodal understanding and is the first MLLM capable of simultaneously processing spatial-temporal fine-grained understanding tasks.

📝 Citation

@misc{li2025llavastmultimodallargelanguage,
      title={LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding}, 
      author={Hongyu Li and Jinyu Chen and Ziyu Wei and Shaofei Huang and Tianrui Hui and Jialin Gao and Xiaoming Wei and Si Liu},
      year={2025},
      eprint={2501.08282},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.08282}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figs		figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

📰 News

📝 Abstract

😲 First MLLM with Spatial-Temporal Fine-Grained Understanding Capacity

📝 Citation

About

Releases

Packages

appletea233/LLaVA-ST

Folders and files

Latest commit

History

Repository files navigation

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

📰 News

📝 Abstract

😲 First MLLM with Spatial-Temporal Fine-Grained Understanding Capacity

📝 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages