Challenging and Enhancing the Reasoning Capacity of Multimodal LLMs in Context-violating Images

Hongxi Li, Yuyang Chen, Yayun Qi, Xinxiao Wu,
Beijing Institute of Technology
arXiv 2024
🌎Website | 📚Dataset | 📄arXiv (Comming soon) | 🏆 Leaderboard

1. Project Structure

├── models：checkpoint files
├── datasets
│   ├── images
│   └── annotation.xlsx
├── database：
├── results
│   ├── image_caption：
│   ├── question_answer：
│   ├── image_identification：
│   └── image_explanation：
├── baseline：
│   ├── LLaVA：LLaVA official project
│   ├── mPLUG-Owl：mPLUG-Owl official project
│   ├── mPLUG-Owl2：mPLUG-Owl2 official project
│   ├── Otter：Otter official project
│   ├── openflamingo：openflamingo official project
│   ├── MIC：MMICL official project
│   ├── llama：llama official project
│   ├── FastChat：vicuna official project
│   ├── demo
│   ├── infer_image_caption.py
│   ├── infer_question_answer.py
│   ├── infer_image_identification.py
│   ├── infer_image_explanation.py
│   └── pipeline.py
├── ours
│   ├── GLIP: GLIP official project 
│   ├── database_construct.py
│   ├── retrieval_augment_generation.py
│   └── object_detection.py
├── tools
│   ├── generate_vqa.py
│   ├── preprocess.py
│   └── download.py
└── evaluate
│   ├── eval_image_caption.py
│   ├── eval_question_answer.py
│   ├── eval_question_answer.py
│   ├── eval_pipeline.py
│   └── eval_image_explanation.py
└── result
│   ├── image_caption:table-1
│   ├── question_answer:table-1
│   ├── pipeline_identification:table-2
│   ├── pipeline_explanation:table-2
│   ├── image_identification:table-3
│   └── image_explanation:table-3
└── readme.md

2. Run

(1) checkpoint download

export HF_ENDPOINT=https://hf-mirror.com
cd main
python download.py

(2) data preprocess

python preprocess.py -task caption
python preprocess.py -task explanation
python preprocess.py -task vqa

(3) environmental installation

BLIP series, mplug_owl series, llava series, etc. (refer to the environmental configuration of the official GitHub project)

conda activate blip
conda activate mplug_owl
conda activate mplug_owl2
conda activate llava
conda activate llama
conda activate vicuna
conda activate cfr

(4) baseline inference

[Note:] The experimental test inputs for the four tasks vary slightly.

Image Captioning and Visual Question Answering are tested on all images, including positive and negative samples. Only zero-shot setting is applied.
Image Recognition is tested on all images. In the few-shot setting, in addition to reading the test image, it is also necessary to read 2 random samples (which may be positive or negative) from the same knowledge background as the test image; in the CoCoT setting, in addition to reading the test sample, it is also necessary to read the corresponding 1 opposite sample.
Image Explanation is tested on negative sample images. In the few-shot setting, in addition to reading the test image, it is also necessary to read 2 random samples (which may be positive or negative) from the same knowledge background as the test image; in the CoCoT setting, in addition to reading the test sample, it is also necessary to read the corresponding 1 opposite sample (positive sample).
GPT-4V is an exception. Post-processing of the results from Image Explanation is performed to obtain the results for Image Recognition, therefore its Image Recognition is also tested on negative samples.
In the pipeline method, under the few-shot setting, the samples selected are two samples drawn from the entire dataset (which may be positive, negative, or from other knowledge backgrounds).

image caption inference

python infer_image_caption.py -model BLIP-Base

VQA inference

python infer_question_answer.py -model BLIP-Base

image identification inference

python infer_image_identification.py -model BLIP2-XL -setting z

image explanation inference

python infer_image_explanation.py -model BLIP2-XL -setting z

pipeline method inference

python pipeline.py -model LLaMA-2-7B -setting z -withCoT n

baseline models for image caption：

model	checkpoint file
BLIP-Base	./models/blip-image-captioning-base
BLIP2-XL	./models/blip2-flan-t5-xl
BLIP2-XXL	./models/blip2-flan-t5-xxl
InstructBLIP-XL	./models/instructblip-flan-t5-xl
InstructBLIP-XXL	./models/instructblip-flan-t5-xl
mPLUG-owl-7B	./models/mplug-owl-llama-7b
mPLUG-owl2-7B	./models/mplug-owl2-llama-7b
LLaVA-1.5-7B	./models/llava-v1.5-7b
LLaVA-1.6-7B	./models/llava-v1.6-vicuna-7b

baseline models for VQA：

model	checkpoint file
BLIP-Base	./models/blip-vqa-base
BLIP2-XL	./models/blip2-flan-t5-xl
BLIP2-XXL	./models/blip2-flan-t5-xxl
InstructBLIP-XL	./models/instructblip-flan-t5-xl
InstructBLIP-XXL	./models/instructblip-flan-t5-xl
mPLUG-owl-7B	./models/mplug-owl-llama-7b
mPLUG-owl2-7B	./models/mplug-owl-llama2-7b
LLaVA-1.5-7B	./models/llava-v1.5-7b
LLaVA-1.6-7B	./models/llava-v1.6-vicuna-7b

baseline models for image indentification and explanation

model	checkpoint file	setting
BLIP2-XL	./models/blip2-flan-t5-xl	zero-shot
BLIP2-XXL	./models/blip2-flan-t5-xxl	zero-shot
InstructBLIP-XL	./models/instructblip-flan-t5-xl	zero-shot
InstructBLIP-XXL	./models/instructblip-flan-t5-xl	zero-shot
mPLUG-owl-7B	./models/mplug-owl-llama-7b	zero-shot
mPLUG-owl2-7B	./models/mplug-owl2-llama-7b	zero-shot
LLaVA-1.5-7B	./models/llava-v1.5-7b	zero-shot
LLaVA-1.6-7B	./models/llava-v1.6-vicuna-7b	zero-shot
MMICL	./models/MMICL-Instructblip-T5-xl	few-shot, CoCoT
OpenFlamingo	./models/OpenFlamingo-3B-vitl-mpt1b	few-shot, CoCoT
Otter-7B	./models/OTTER-Image-LLaMA7B-LA-InContext	few-shot, CoCoT
GEMINI	coming soon...	few-shot, CoCoT
GPT-4V	sk-XXXXXXXXXXXXXXXXXXXXX	few-shot, CoCoT

LLM models for pipeline method

model	checkpoint file
llama-2-7b	./models/Llama-2-7b-hf
llama-2-13b	./models/Llama-2-13b-hf
vicuna-1.5-7b	./models/vicuna-7b-v1.5
vicuna-1.5-7b	./models/vicuna-13b-v1.5
GPT-3.5	sk-XXXXXXXXXXXXXXXXXXXXX

(5) ours method inference

coming soon...

3. performance evaluation

image caption evaluation

python evaluate/eval_image_caption.py

VQA evaluation

python evaluate/eval_question_answer.py

image identification evaluation

python evaluate/eval_identification.py

image explanation evaluation

python evaluate/eval_image_explanation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Challenging and Enhancing the Reasoning Capacity of Multimodal LLMs in Context-violating Images

1. Project Structure

2. Run

(1) checkpoint download

(2) data preprocess

(3) environmental installation

(4) baseline inference

(5) ours method inference

3. performance evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
baseline		baseline
database		database
dataset		dataset
evaluate		evaluate
ours		ours
results/image_caption		results/image_caption
tools		tools
readme.md		readme.md

Tough-Stone/Counter-Context-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Challenging and Enhancing the Reasoning Capacity of Multimodal LLMs in Context-violating Images

1. Project Structure

2. Run

(1) checkpoint download

(2) data preprocess

(3) environmental installation

(4) baseline inference

(5) ours method inference

3. performance evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages