Original Visual Spatial Reasoning repo
Note: Currently this is true zero shot (so no fine tuning). I benchmark the following CLIP models:
- OpenClip laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
- OpenClip laion/CLIP-ViT-H-14-laion2B-s32B-b79K
- OpenAI Clip openai/clip-vit-large-patch14-336
Findings:
- Using the (True) / (False) modifiers proposed in the paper results gives no better than random results.
- After experimenting with many stratagies for modifying the prompts I was able to get results at 55% (so slightly better than average)
Open questions:
- Will fine tuning the modle show same/better results as the model types in the VSR paper
- How do the different relationship score (does CLIP nativly understand any relationships resonable well)
python src\train.py --base_model ViT-L/14@336px --mini_batch_size 20 --batch_size 500 --learning_rate 2e-5
test_accuracy: 65.07% trained model: model_run-113-65-07.pt
uses the modified prompts ie:
- The horse is left of
- The horse is left of the person.
python src\eval002.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Score: 55.23%
python src\eval002.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K
Score: 55.44%
python src\eval002.py --model_url openai/clip-vit-large-patch14-336
Score: 54.39%
python src\eval001.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Score: 55.23%
python src\eval001.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K
Score: 53.83%
python src\eval001.py --model_url openai/clip-vit-large-patch14-336
Score: 53.86%
uses the prompts from the VSR paper (but without retraining); ie:
- The horse is left of the person. (False)
- The horse is left of the person. (True)
python src\eval000.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Score: 49.24%
python src\eval000.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K
Score: 49.51%
python src\eval000.py --model_url openai/clip-vit-large-patch14-336
Score: 48.85%
conda env create
conda activate clip-vsr
python src\eval.py
See data/
folder's readme. Images should be saved under data/images/
.
If you use the VSR dataset please site the orginal authors:
@article{Liu2022VisualSR,
title={Visual Spatial Reasoning},
author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
journal={ArXiv},
year={2022},
volume={abs/2205.00363}
}
This project is licensed under the Apache-2.0 License.