Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi
Project page (https://groma-mllm.github.io)
Clone the repository
git clone https://github.com/FoundationVision/Groma.git
cd Groma
Create the conda environment and install dependencies
conda create -n groma python=3.9 -y
conda activate groma
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install --upgrade pip # enable PEP 660 support
pip install -e .
cd mmcv
MMCV_WITH_OPS=1 pip install -e .
cd ..
Install falsh-attention for training
pip install ninja
pip install flash-attn --no-build-isolation
To play with Groma, please download the model weights from huggingface.
We additionally provide pretrained checkpoints from intermediate training stages. You can start from any point to customize training.
Training stage | Required checkpoints |
---|---|
Detection pretraining | DINOv2-L |
Alignment pretraining | Vicuna-7b-v1.5, Groma-det-pretrain |
Instruction finetuning | Groma-7b-pretrain |
We provide instructions to download datasets used at different training stages of Groma, including Groma Instruct, a 30k viusally grounded conversation dataset constructed with GPT-4V. You don't have to download all of them unless you want to train Groma from scratch. Please follow instructions in DATA.md to prepare datasets.
Training stage | Data types | Datasets |
---|---|---|
Detection pretraining | Detection | COCO, Objects365, OpenImages, V3Det, SA1B |
Alignment pretraining | Image caption | ShareGPT-4V-PT |
Grounded caption | Flickr30k Entities | |
Region caption | Visual Genome, RefCOCOg | |
REC | COCO, RefCOCO/g/+, Grit-20m | |
Instruction finetuning | Grounded caption | Flickr30k Entities |
Region caption | Visual Genome, RefCOCOg | |
REC | COCO, RefCOCO/g/+ | |
Instruction following | Groma Instruct, LLaVA Instruct, ShareGPT-4V |
For detection pretraining, please run
bash scripts/det_pretrain.sh {path_to_dinov2_ckpt} {output_dir}
For alignment pretraining, please run
bash scripts/vl_pretrain.sh {path_to_vicuna_ckpt} {path_to_groma_det_pretrain_ckpt} {output_dir}
For instruction finetuing, please run
bash scripts/vl_finetune.sh {path_to_groma_7b_pretrain_ckpt} {output_dir}
To test on single image, you can run
python -m llava.eval.run_groma \
--model-name {path_to_groma_7b_finetune} \
--image-file {path_to_img} \
--query {user_query}
For evaluation, please refer to EVAL.md for more details.
Groma is built upon the awesome works LLaVA and GPT4ROI.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If our work assists your research, feel free to give us a star ⭐ or cite us using:
@misc{Groma,
title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models},
author={Chuofan Ma and Yi Jiang and Jiannan Wu and Zehuan Yuan and Xiaojuan Qi},
year={2024},
eprint={2404.13013},
archivePrefix={arXiv},
primaryClass={cs.CV}
}