git clone https://github.com/zhuqiangLu/B-VLLM.git
cd B-VLLM
conda create -n bvllm python==3.10
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation
Here, we utilize the Video-LLaVA dataset to train our model. Note that we follow the training receipt provided by Video-LLaMA2 to train our model. Once the dataset is downloaded, organize the them as follow:
B-VLLM
├── datasets
│ ├── videollava_pt
| | ├── llava_image/
| | ├── valley/
| | └── valley_llavaimage.json # Available at: https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view, including 703K video-text and 558K image-text pairs
│ ├── videollava_sft
| | ├── llava_image_tune/
| | ├── videochatgpt_tune/
| | └── videochatgpt_llavaimage_tune.json # Available at: https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view, including 100K video-centric, 625K image-centric and 40K text-only conversations
Make sure you update the ARG_NPROC_PER_NODE
to match your available GPUs before running the script.
bash scripts/vllava/pretrain.sh
bash scripts/vllava/finetune.sh
Comning Soon
This repo is developed upon VideoLLaMA2 and LLaMA-VID
@article{lu2024b,
title={B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens},
author={Lu, Zhuqiang and Yin, Zhenfei and He, Mengwei and Wang, Zhihui and Liu, Zicheng and Wang, Zhiyong and Hu, Kun},
journal={arXiv preprint arXiv:2412.09919},
year={2024}
}