Skip to content

zhuqiangLu/B-VLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

License arXiv

Installation

git clone https://github.com/zhuqiangLu/B-VLLM.git
cd B-VLLM
conda create -n bvllm python==3.10
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

Data Preparation

Here, we utilize the Video-LLaVA dataset to train our model. Note that we follow the training receipt provided by Video-LLaMA2 to train our model. Once the dataset is downloaded, organize the them as follow:

B-VLLM
├── datasets
│   ├── videollava_pt
|   |   ├── llava_image/ 
|   |   ├── valley/     
|   |   └── valley_llavaimage.json # Available at: https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view, including 703K video-text and 558K image-text pairs
│   ├── videollava_sft
|   |   ├── llava_image_tune/  
|   |   ├── videochatgpt_tune/ 
|   |   └── videochatgpt_llavaimage_tune.json # Available at: https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view, including 100K video-centric, 625K image-centric and 40K text-only conversations

Training

Make sure you update the ARG_NPROC_PER_NODE to match your available GPUs before running the script.

bash scripts/vllava/pretrain.sh
bash scripts/vllava/finetune.sh

Evaluation

Comning Soon

Acknowledgement

This repo is developed upon VideoLLaMA2 and LLaMA-VID

Cite

@article{lu2024b,
  title={B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens},
  author={Lu, Zhuqiang and Yin, Zhenfei and He, Mengwei and Wang, Zhihui and Liu, Zicheng and Wang, Zhiyong and Hu, Kun},
  journal={arXiv preprint arXiv:2412.09919},
  year={2024}
}

About

The official repository of B-VLLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published