Hongxi Li,
Yuyang Chen,
Yayun Qi,
Xinxiao Wu,
Beijing Institute of Technology
arXiv 2024
🌎Website |
📚Dataset |
📄arXiv (Comming soon) |
🏆 Leaderboard
├── models:checkpoint files
├── datasets
│ ├── images
│ └── annotation.xlsx
├── database:
├── results
│ ├── image_caption:
│ ├── question_answer:
│ ├── image_identification:
│ └── image_explanation:
├── baseline:
│ ├── LLaVA:LLaVA official project
│ ├── mPLUG-Owl:mPLUG-Owl official project
│ ├── mPLUG-Owl2:mPLUG-Owl2 official project
│ ├── Otter:Otter official project
│ ├── openflamingo:openflamingo official project
│ ├── MIC:MMICL official project
│ ├── llama:llama official project
│ ├── FastChat:vicuna official project
│ ├── demo
│ ├── infer_image_caption.py
│ ├── infer_question_answer.py
│ ├── infer_image_identification.py
│ ├── infer_image_explanation.py
│ └── pipeline.py
├── ours
│ ├── GLIP: GLIP official project
│ ├── database_construct.py
│ ├── retrieval_augment_generation.py
│ └── object_detection.py
├── tools
│ ├── generate_vqa.py
│ ├── preprocess.py
│ └── download.py
└── evaluate
│ ├── eval_image_caption.py
│ ├── eval_question_answer.py
│ ├── eval_question_answer.py
│ ├── eval_pipeline.py
│ └── eval_image_explanation.py
└── result
│ ├── image_caption:table-1
│ ├── question_answer:table-1
│ ├── pipeline_identification:table-2
│ ├── pipeline_explanation:table-2
│ ├── image_identification:table-3
│ └── image_explanation:table-3
└── readme.md
export HF_ENDPOINT=https://hf-mirror.com
cd main
python download.py
python preprocess.py -task caption
python preprocess.py -task explanation
python preprocess.py -task vqa
BLIP series, mplug_owl series, llava series, etc. (refer to the environmental configuration of the official GitHub project)
conda activate blip
conda activate mplug_owl
conda activate mplug_owl2
conda activate llava
conda activate llama
conda activate vicuna
conda activate cfr
[Note:] The experimental test inputs for the four tasks vary slightly.
- Image Captioning and Visual Question Answering are tested on all images, including positive and negative samples. Only zero-shot setting is applied.
- Image Recognition is tested on all images. In the few-shot setting, in addition to reading the test image, it is also necessary to read 2 random samples (which may be positive or negative) from the same knowledge background as the test image; in the CoCoT setting, in addition to reading the test sample, it is also necessary to read the corresponding 1 opposite sample.
- Image Explanation is tested on negative sample images. In the few-shot setting, in addition to reading the test image, it is also necessary to read 2 random samples (which may be positive or negative) from the same knowledge background as the test image; in the CoCoT setting, in addition to reading the test sample, it is also necessary to read the corresponding 1 opposite sample (positive sample).
- GPT-4V is an exception. Post-processing of the results from Image Explanation is performed to obtain the results for Image Recognition, therefore its Image Recognition is also tested on negative samples.
- In the pipeline method, under the few-shot setting, the samples selected are two samples drawn from the entire dataset (which may be positive, negative, or from other knowledge backgrounds).
- image caption inference
python infer_image_caption.py -model BLIP-Base
- VQA inference
python infer_question_answer.py -model BLIP-Base
- image identification inference
python infer_image_identification.py -model BLIP2-XL -setting z
- image explanation inference
python infer_image_explanation.py -model BLIP2-XL -setting z
- pipeline method inference
python pipeline.py -model LLaMA-2-7B -setting z -withCoT n
baseline models for image caption:
model | checkpoint file |
---|---|
BLIP-Base | ./models/blip-image-captioning-base |
BLIP2-XL | ./models/blip2-flan-t5-xl |
BLIP2-XXL | ./models/blip2-flan-t5-xxl |
InstructBLIP-XL | ./models/instructblip-flan-t5-xl |
InstructBLIP-XXL | ./models/instructblip-flan-t5-xl |
mPLUG-owl-7B | ./models/mplug-owl-llama-7b |
mPLUG-owl2-7B | ./models/mplug-owl2-llama-7b |
LLaVA-1.5-7B | ./models/llava-v1.5-7b |
LLaVA-1.6-7B | ./models/llava-v1.6-vicuna-7b |
baseline models for VQA:
model | checkpoint file |
---|---|
BLIP-Base | ./models/blip-vqa-base |
BLIP2-XL | ./models/blip2-flan-t5-xl |
BLIP2-XXL | ./models/blip2-flan-t5-xxl |
InstructBLIP-XL | ./models/instructblip-flan-t5-xl |
InstructBLIP-XXL | ./models/instructblip-flan-t5-xl |
mPLUG-owl-7B | ./models/mplug-owl-llama-7b |
mPLUG-owl2-7B | ./models/mplug-owl-llama2-7b |
LLaVA-1.5-7B | ./models/llava-v1.5-7b |
LLaVA-1.6-7B | ./models/llava-v1.6-vicuna-7b |
baseline models for image indentification and explanation
model | checkpoint file | setting |
---|---|---|
BLIP2-XL | ./models/blip2-flan-t5-xl | zero-shot |
BLIP2-XXL | ./models/blip2-flan-t5-xxl | zero-shot |
InstructBLIP-XL | ./models/instructblip-flan-t5-xl | zero-shot |
InstructBLIP-XXL | ./models/instructblip-flan-t5-xl | zero-shot |
mPLUG-owl-7B | ./models/mplug-owl-llama-7b | zero-shot |
mPLUG-owl2-7B | ./models/mplug-owl2-llama-7b | zero-shot |
LLaVA-1.5-7B | ./models/llava-v1.5-7b | zero-shot |
LLaVA-1.6-7B | ./models/llava-v1.6-vicuna-7b | zero-shot |
MMICL | ./models/MMICL-Instructblip-T5-xl | few-shot, CoCoT |
OpenFlamingo | ./models/OpenFlamingo-3B-vitl-mpt1b | few-shot, CoCoT |
Otter-7B | ./models/OTTER-Image-LLaMA7B-LA-InContext | few-shot, CoCoT |
GEMINI | coming soon... | few-shot, CoCoT |
GPT-4V | sk-XXXXXXXXXXXXXXXXXXXXX | few-shot, CoCoT |
LLM models for pipeline method
model | checkpoint file |
---|---|
llama-2-7b | ./models/Llama-2-7b-hf |
llama-2-13b | ./models/Llama-2-13b-hf |
vicuna-1.5-7b | ./models/vicuna-7b-v1.5 |
vicuna-1.5-7b | ./models/vicuna-13b-v1.5 |
GPT-3.5 | sk-XXXXXXXXXXXXXXXXXXXXX |
coming soon...
- image caption evaluation
python evaluate/eval_image_caption.py
- VQA evaluation
python evaluate/eval_question_answer.py
- image identification evaluation
python evaluate/eval_identification.py
- image explanation evaluation
python evaluate/eval_image_explanation.py