GitHub - iSEE-Laboratory/LLMDet: Official repository of paper "LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models"

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

This is the official PyTorch implementation of LLMDet.

1 Introduction

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits.

2 Model Zoo

Model	AP^mini	AP_r	AP_c	AP_f	AP^val	AP_r	AP_c	AP_f
LLMDet Swin-T only p5	44.5	38.6	39.3	50.3	34.6	25.5	29.9	43.8
LLMDet Swin-T	44.7	37.3	39.5	50.7	34.9	26.0	30.1	44.3
LLMDet Swin-B	48.3	40.8	43.1	54.3	38.5	28.2	34.3	47.8
LLMDet Swin-L	51.1	45.1	46.1	56.6	42.0	31.6	38.8	50.2
LLMDet Swin-L (chunk size 80)	52.4	44.3	48.8	57.1	43.2	32.8	40.5	50.8

NOTE:

AP^mini: evaluated on LVIS minival.
AP^val: evaluated on LVIS val 1.0.
AP is fixed AP.
All the checkpoints and logs can be found in huggingface and modelscope.
Other benchmarks are tested using LLMDet Swin-T only p5.

3 Our Experiment Environment

Note: other environments may also work.

pytorch==2.2.1+cu121
transformers==4.37.2
numpy==1.22.2 (numpy should be lower than 1.24, recommend for numpy==1.23 or 1.22)
mmcv==2.2.0, mmengine==0.10.5
timm, deepspeed, pycocotools, lvis, jsonlines, fairscale, nltk, peft, wandb

4 Data Preparation

｜--huggingface
｜  |--bert-base-uncased
｜  |--siglip-so400m-patch14-384
｜  |--my_llava-onevision-qwen2-0.5b-ov-2
｜  |--mm_grounding_dino
｜  |  |--grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
｜  |  |--grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
｜  |  |--grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
｜--grounding_data
｜  |--coco
｜  |  |--annotations
｜  |  |  |--instances_train2017_vg_merged6.jsonl
｜  |  |  |--instances_val2017.json
｜  |  |  |--lvis_v1_minival_inserted_image_name.json
｜  |  |  |--lvis_od_val.json
｜  |  |--train2017
｜  |  |--val2017
｜  |--flickr30k_entities
｜  |  |--flickr_train_vg7.jsonl
｜  |  |--flickr30k_images
｜  |--gqa
｜  |  |--gqa_train_vg7.jsonl
｜  |  |--images
｜  |--llava_cap
｜  |  |--LLaVA-ReCap-558K_tag_box_vg7.jsonl
｜  |  |--images
｜  |--v3det
｜  |  |--annotations
｜  |  |  |--v3det_2023_v1_train_vg7.jsonl
｜  |  |--images
｜--LLMDet (code)

pretrained models
- bert-base-uncased, siglip-so400m-patch14-384 are directly downloaded from huggingface.
- To fully reproduce our results, please download my_llava-onevision-qwen2-0.5b-ov-2 from huggingface or modelscope, which is slightly fine-tuned by us in early exploration. We find that the original llava-onevision-qwen2-0.5b-ov is still OK to reproduce our results but users should pretrain the projector.
- Since LLMDet is fine-tuned frommm_grounding_dino, please download their checkpoints swin-t, swin-b, swin-l for training.
grounding data (GroundingCap-1M)
- coco: You can download it from the COCO official website or from opendatalab.
- lvis: LVIS shares the same images with COCO. You can download the minival annotation file from here, and the val 1.0 annotation file from here.
- flickr30k_entities：Flickr30k images.
- gqa： GQA images.
- llava_cap：images .
- v3det：The V3Det dataset can be downloaded from opendatalab.
- Our generated jsonls can be found huggingface or modelscope.
- For other evalation datasets, please refer to MM-GDINO.

5 Usage

5.1 Training

bash dist_train.sh configs/grounding_dino_swin_t.py 8 --amp

5.2 Evaluation

bash dist_test.sh configs/grounding_dino_swin_t.py tiny.pth 8

6 License

LLMDet is released under the Apache 2.0 license. Please see the LICENSE file for more information.

7 Bibtex

If you find our work helpful for your research, please consider citing our paper.

@article{fu2025llmdet,
  title={Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}

8 Acknowledgement

Our LLMDet is heavily inspired by many outstanding prior works, including

Thank the authors of above projects for open-sourcing their assets!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
llava		llava
mmdet		mmdet
ram		ram
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compare_result.png		compare_result.png
dist_test.sh		dist_test.sh
dist_train.sh		dist_train.sh
mmdet_test.py		mmdet_test.py
mmdet_train.py		mmdet_train.py
train.py		train.py
train_xformers.py		train_xformers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

1 Introduction

2 Model Zoo

3 Our Experiment Environment

4 Data Preparation

5 Usage

5.1 Training

5.2 Evaluation

6 License

7 Bibtex

8 Acknowledgement

About

Releases

Packages

Languages

License

iSEE-Laboratory/LLMDet

Folders and files

Latest commit

History

Repository files navigation

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

1 Introduction

2 Model Zoo

3 Our Experiment Environment

4 Data Preparation

5 Usage

5.1 Training

5.2 Evaluation

6 License

7 Bibtex

8 Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages