Skip to content

Latest commit

 

History

History
123 lines (102 loc) · 4.58 KB

README.md

File metadata and controls

123 lines (102 loc) · 4.58 KB

OVO-Bench: How Far is Your Video-LLMs from Real-World Online VideO Understanding?

Introduction

🌟 Three distinct problem-solving modes

  • Backward Tracing: trace back to past events to answer the question.
  • Real-Time Visual Perception: understand and respond to events as they unfold at the current timestamp.
  • Forward Active Responding: delay the response until sufficient future information becomes available to answer the question accurately.

💫Chain-of-Time Thinking Process

OVO-Bench evaluates Video-LLMs' ability to find temporal visual clues from ongoing input, allowing models to wait for sufficient evidence before responding. We term this approach the Video Chain-of-Time thinking process, analogous to Chain-of-Thought reasoning in LLMs.

Distribution of questions and video in OVO-Bench.

Dataset Statistics

  • 644 videos
  • 3,100 Queries

Distribution of averaged query timestamps and
video duration (in seconds) in OVOBench.

  • 263.42s Average query timestamp.

Distribution of questions and video in OVO-Bench.

Dataset Examples

Distribution of questions and video in OVO-Bench.

Evaluation Pipeline

Requirements

Following modules are required for inference and scoring pipeline.

moviepy==1.0.3
numpy
pillow
tqdm

Or run pip insall -r requirements to install all required modules.

Data Preparation

Download videos and annotations from our huggingface-repo, unzip all files and place them under ./data directory.

Inference and Score

We divide our evaluation pipeline into two parts: inference and score. For our released models, run our provided scripts under ./scripts directory. For example, for InternVL2, run:

bash scripts/inference_Gemini.sh

All inference results will be saved under ./results/[MODEL_NAME]. Then run our scoring scripts:

bash scripts/score_Gemini.sh

Scores will show in cli:

Offline Model: Gemini
Evaluate Backward Tracing...
Task: HLD, Acc: 52.69
Task: ASI, Acc: 75.68
Task: EPM, Acc: 58.59
Backward Avg.: 62.32

Evaluate Real-time Visual Perception...
Task: STU, Acc: 54.49
Task: OJR, Acc: 67.39
Task: ATR, Acc: 80.17
Task: FPD, Acc: 68.32
Task: ACR, Acc: 66.97
Task: OCR, Acc: 87.25
Realtime Avg.: 70.77

Evaluate Forward Active Responding...
Task: REC, Acc: 35.53
Task: SSR, Acc: 74.24
Task: CRR, Acc: 61.67
Forward Avg.: 57.15

Total Avg.: 65.25

To evaluate your own models, inherit OVOBenchOffline/Online class in ./utils/OVOBench.py and implement your own inference pipeline. Refer to our provided models under ./models for further details.

License

OVO-Bench is released under CC BY-NC-SA 4.0 license. By downloading our dataset from our website or other sources, the user agrees to adhere to the terms of CC BY-NC-SA 4.0 and licenses of the source datasets

🫥 Experimental Results

Distribution of questions and video in OVO-Bench.

📍 Citing OVO-Bench

@misc{li2025ovobenchfarvideollmsrealworld,
      title={OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?}, 
      author={Yifei Li and Junbo Niu and Ziyang Miao and Chunjiang Ge and Yuanhang Zhou and Qihao He and Xiaoyi Dong and Haodong Duan and Shuangrui Ding and Rui Qian and Pan Zhang and Yuhang Zang and Yuhang Cao and Conghui He and Jiaqi Wang},
      year={2025},
      eprint={2501.05510},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.05510}, 
}