GitHub - zhuqiangLu/B-VLLM: The official repository of B-VLLM

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Installation

git clone https://github.com/zhuqiangLu/B-VLLM.git
cd B-VLLM
conda create -n bvllm python==3.10
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

Data Preparation

Here, we utilize the Video-LLaVA dataset to train our model. Note that we follow the training receipt provided by Video-LLaMA2 to train our model. Once the dataset is downloaded, organize the them as follow:

B-VLLM
├── datasets
│   ├── videollava_pt
|   |   ├── llava_image/ 
|   |   ├── valley/     
|   |   └── valley_llavaimage.json # Available at: https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view, including 703K video-text and 558K image-text pairs
│   ├── videollava_sft
|   |   ├── llava_image_tune/  
|   |   ├── videochatgpt_tune/ 
|   |   └── videochatgpt_llavaimage_tune.json # Available at: https://drive.google.com/file/d/1zGRyVSUMoczGq6cjQFmT0prH67bu2wXD/view, including 100K video-centric, 625K image-centric and 40K text-only conversations

Training

Make sure you update the ARG_NPROC_PER_NODE to match your available GPUs before running the script.

bash scripts/vllava/pretrain.sh
bash scripts/vllava/finetune.sh

Evaluation

Comning Soon

Acknowledgement

This repo is developed upon VideoLLaMA2 and LLaMA-VID

Cite

@article{lu2024b,
  title={B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens},
  author={Lu, Zhuqiang and Yin, Zhenfei and He, Mengwei and Wang, Zhihui and Liu, Zicheng and Wang, Zhiyong and Hu, Kun},
  journal={arXiv preprint arXiv:2412.09919},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
bvlm		bvlm
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Installation

Data Preparation

Training

Evaluation

Acknowledgement

Cite

About

Releases

Packages

Languages

License

zhuqiangLu/B-VLLM

Folders and files

Latest commit

History

Repository files navigation

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Installation

Data Preparation

Training

Evaluation

Acknowledgement

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages