Skip to content

Official repository of paper "LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models"

License

Notifications You must be signed in to change notification settings

iSEE-Laboratory/LLMDet

Repository files navigation

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

This is the official PyTorch implementation of LLMDet.

1 Introduction

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits.

2 Model Zoo

Model APmini APr APc APf APval APr APc APf
LLMDet Swin-T only p5 44.5 38.6 39.3 50.3 34.6 25.5 29.9 43.8
LLMDet Swin-T 44.7 37.3 39.5 50.7 34.9 26.0 30.1 44.3
LLMDet Swin-B 48.3 40.8 43.1 54.3 38.5 28.2 34.3 47.8
LLMDet Swin-L 51.1 45.1 46.1 56.6 42.0 31.6 38.8 50.2
LLMDet Swin-L (chunk size 80) 52.4 44.3 48.8 57.1 43.2 32.8 40.5 50.8

NOTE:

  1. APmini: evaluated on LVIS minival.
  2. APval: evaluated on LVIS val 1.0.
  3. AP is fixed AP.
  4. All the checkpoints and logs can be found in huggingface and modelscope.
  5. Other benchmarks are tested using LLMDet Swin-T only p5.

3 Our Experiment Environment

Note: other environments may also work.

  • pytorch==2.2.1+cu121
  • transformers==4.37.2
  • numpy==1.22.2 (numpy should be lower than 1.24, recommend for numpy==1.23 or 1.22)
  • mmcv==2.2.0, mmengine==0.10.5
  • timm, deepspeed, pycocotools, lvis, jsonlines, fairscale, nltk, peft, wandb

4 Data Preparation

|--huggingface
|  |--bert-base-uncased
|  |--siglip-so400m-patch14-384
|  |--my_llava-onevision-qwen2-0.5b-ov-2
|  |--mm_grounding_dino
|  |  |--grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
|  |  |--grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
|  |  |--grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
|--grounding_data
|  |--coco
|  |  |--annotations
|  |  |  |--instances_train2017_vg_merged6.jsonl
|  |  |  |--instances_val2017.json
|  |  |  |--lvis_v1_minival_inserted_image_name.json
|  |  |  |--lvis_od_val.json
|  |  |--train2017
|  |  |--val2017
|  |--flickr30k_entities
|  |  |--flickr_train_vg7.jsonl
|  |  |--flickr30k_images
|  |--gqa
|  |  |--gqa_train_vg7.jsonl
|  |  |--images
|  |--llava_cap
|  |  |--LLaVA-ReCap-558K_tag_box_vg7.jsonl
|  |  |--images
|  |--v3det
|  |  |--annotations
|  |  |  |--v3det_2023_v1_train_vg7.jsonl
|  |  |--images
|--LLMDet (code)
  • pretrained models
    • bert-base-uncased, siglip-so400m-patch14-384 are directly downloaded from huggingface.
    • To fully reproduce our results, please download my_llava-onevision-qwen2-0.5b-ov-2 from huggingface or modelscope, which is slightly fine-tuned by us in early exploration. We find that the original llava-onevision-qwen2-0.5b-ov is still OK to reproduce our results but users should pretrain the projector.
    • Since LLMDet is fine-tuned frommm_grounding_dino, please download their checkpoints swin-t, swin-b, swin-l for training.
  • grounding data (GroundingCap-1M)
    • coco: You can download it from the COCO official website or from opendatalab.
    • lvis: LVIS shares the same images with COCO. You can download the minival annotation file from here, and the val 1.0 annotation file from here.
    • flickr30k_entitiesFlickr30k images.
    • gqaGQA images.
    • llava_capimages .
    • v3det:The V3Det dataset can be downloaded from opendatalab.
    • Our generated jsonls can be found huggingface or modelscope.
    • For other evalation datasets, please refer to MM-GDINO.

5 Usage

5.1 Training

bash dist_train.sh configs/grounding_dino_swin_t.py 8 --amp

5.2 Evaluation

bash dist_test.sh configs/grounding_dino_swin_t.py tiny.pth 8

6 License

LLMDet is released under the Apache 2.0 license. Please see the LICENSE file for more information.

7 Bibtex

If you find our work helpful for your research, please consider citing our paper.

@article{fu2025llmdet,
  title={Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}

8 Acknowledgement

Our LLMDet is heavily inspired by many outstanding prior works, including

Thank the authors of above projects for open-sourcing their assets!

About

Official repository of paper "LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages