Skip to content

Tough-Stone/Counter-Context-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Challenging and Enhancing the Reasoning Capacity of Multimodal LLMs in Context-violating Images

Hongxi LiYuyang ChenYayun QiXinxiao Wu
 Beijing Institute of Technology
arXiv 2024
🌎Website | 📚Dataset | 📄arXiv (Comming soon) | 🏆 Leaderboard

dataset description

1. Project Structure

├── models:checkpoint files
├── datasets
│   ├── images
│   └── annotation.xlsx
├── database:
├── results
│   ├── image_caption:
│   ├── question_answer:
│   ├── image_identification:
│   └── image_explanation:
├── baseline:
│   ├── LLaVA:LLaVA official project
│   ├── mPLUG-Owl:mPLUG-Owl official project
│   ├── mPLUG-Owl2:mPLUG-Owl2 official project
│   ├── Otter:Otter official project
│   ├── openflamingo:openflamingo official project
│   ├── MIC:MMICL official project
│   ├── llama:llama official project
│   ├── FastChat:vicuna official project
│   ├── demo
│   ├── infer_image_caption.py
│   ├── infer_question_answer.py
│   ├── infer_image_identification.py
│   ├── infer_image_explanation.py
│   └── pipeline.py
├── ours
│   ├── GLIP: GLIP official project 
│   ├── database_construct.py
│   ├── retrieval_augment_generation.py
│   └── object_detection.py
├── tools
│   ├── generate_vqa.py
│   ├── preprocess.py
│   └── download.py
└── evaluate
│   ├── eval_image_caption.py
│   ├── eval_question_answer.py
│   ├── eval_question_answer.py
│   ├── eval_pipeline.py
│   └── eval_image_explanation.py
└── result
│   ├── image_caption:table-1
│   ├── question_answer:table-1
│   ├── pipeline_identification:table-2
│   ├── pipeline_explanation:table-2
│   ├── image_identification:table-3
│   └── image_explanation:table-3
└── readme.md

2. Run

(1) checkpoint download

export HF_ENDPOINT=https://hf-mirror.com
cd main
python download.py

(2) data preprocess

python preprocess.py -task caption
python preprocess.py -task explanation
python preprocess.py -task vqa

(3) environmental installation

BLIP series, mplug_owl series, llava series, etc. (refer to the environmental configuration of the official GitHub project)

conda activate blip
conda activate mplug_owl
conda activate mplug_owl2
conda activate llava
conda activate llama
conda activate vicuna
conda activate cfr

(4) baseline inference

[Note:] The experimental test inputs for the four tasks vary slightly.

  • Image Captioning and Visual Question Answering are tested on all images, including positive and negative samples. Only zero-shot setting is applied.
  • Image Recognition is tested on all images. In the few-shot setting, in addition to reading the test image, it is also necessary to read 2 random samples (which may be positive or negative) from the same knowledge background as the test image; in the CoCoT setting, in addition to reading the test sample, it is also necessary to read the corresponding 1 opposite sample.
  • Image Explanation is tested on negative sample images. In the few-shot setting, in addition to reading the test image, it is also necessary to read 2 random samples (which may be positive or negative) from the same knowledge background as the test image; in the CoCoT setting, in addition to reading the test sample, it is also necessary to read the corresponding 1 opposite sample (positive sample).
  • GPT-4V is an exception. Post-processing of the results from Image Explanation is performed to obtain the results for Image Recognition, therefore its Image Recognition is also tested on negative samples.
  • In the pipeline method, under the few-shot setting, the samples selected are two samples drawn from the entire dataset (which may be positive, negative, or from other knowledge backgrounds).
  1. image caption inference
python infer_image_caption.py -model BLIP-Base
  1. VQA inference
python infer_question_answer.py -model BLIP-Base
  1. image identification inference
python infer_image_identification.py -model BLIP2-XL -setting z
  1. image explanation inference
python infer_image_explanation.py -model BLIP2-XL -setting z
  1. pipeline method inference
python pipeline.py -model LLaMA-2-7B -setting z -withCoT n

baseline models for image caption:

model checkpoint file
BLIP-Base ./models/blip-image-captioning-base
BLIP2-XL ./models/blip2-flan-t5-xl
BLIP2-XXL ./models/blip2-flan-t5-xxl
InstructBLIP-XL ./models/instructblip-flan-t5-xl
InstructBLIP-XXL ./models/instructblip-flan-t5-xl
mPLUG-owl-7B ./models/mplug-owl-llama-7b
mPLUG-owl2-7B ./models/mplug-owl2-llama-7b
LLaVA-1.5-7B ./models/llava-v1.5-7b
LLaVA-1.6-7B ./models/llava-v1.6-vicuna-7b

baseline models for VQA:

model checkpoint file
BLIP-Base ./models/blip-vqa-base
BLIP2-XL ./models/blip2-flan-t5-xl
BLIP2-XXL ./models/blip2-flan-t5-xxl
InstructBLIP-XL ./models/instructblip-flan-t5-xl
InstructBLIP-XXL ./models/instructblip-flan-t5-xl
mPLUG-owl-7B ./models/mplug-owl-llama-7b
mPLUG-owl2-7B ./models/mplug-owl-llama2-7b
LLaVA-1.5-7B ./models/llava-v1.5-7b
LLaVA-1.6-7B ./models/llava-v1.6-vicuna-7b

baseline models for image indentification and explanation

model checkpoint file setting
BLIP2-XL ./models/blip2-flan-t5-xl zero-shot
BLIP2-XXL ./models/blip2-flan-t5-xxl zero-shot
InstructBLIP-XL ./models/instructblip-flan-t5-xl zero-shot
InstructBLIP-XXL ./models/instructblip-flan-t5-xl zero-shot
mPLUG-owl-7B ./models/mplug-owl-llama-7b zero-shot
mPLUG-owl2-7B ./models/mplug-owl2-llama-7b zero-shot
LLaVA-1.5-7B ./models/llava-v1.5-7b zero-shot
LLaVA-1.6-7B ./models/llava-v1.6-vicuna-7b zero-shot
MMICL ./models/MMICL-Instructblip-T5-xl few-shot, CoCoT
OpenFlamingo ./models/OpenFlamingo-3B-vitl-mpt1b few-shot, CoCoT
Otter-7B ./models/OTTER-Image-LLaMA7B-LA-InContext few-shot, CoCoT
GEMINI coming soon... few-shot, CoCoT
GPT-4V sk-XXXXXXXXXXXXXXXXXXXXX few-shot, CoCoT

LLM models for pipeline method

model checkpoint file
llama-2-7b ./models/Llama-2-7b-hf
llama-2-13b ./models/Llama-2-13b-hf
vicuna-1.5-7b ./models/vicuna-7b-v1.5
vicuna-1.5-7b ./models/vicuna-13b-v1.5
GPT-3.5 sk-XXXXXXXXXXXXXXXXXXXXX

(5) ours method inference

coming soon...

3. performance evaluation

  1. image caption evaluation
python evaluate/eval_image_caption.py
  1. VQA evaluation
python evaluate/eval_question_answer.py
  1. image identification evaluation
python evaluate/eval_identification.py
  1. image explanation evaluation
python evaluate/eval_image_explanation.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages