Ron Campos* , Ashmal Vayani* , Parth Parag Kulkarni*, Rohit Gupta , Mubarak Shah
- Mar-20-25- Technical report of GAEA is released on arxiv! π₯π₯
- Mar-20-25- GAEA-1.6M, GAEA-Bench Dataset and codes are released. GAEA-Bench 4,000 diverse conversational QA pairs equipped with geolocalization capabilities. GAEA1.6M entails over 1.6M QA pairs for enhancing the conversational capabilities of geolocalizable LMM, GAEA. π₯π₯
Figure: Data Collection and Annotation Pipeline. (Left): GAEA-1.6M includes geographically diverse visual samples from various data sources, such as MP-16, GLD-v2, and CityGuesser68k. (Middle): We also incorporate OpenStreetMap (OSM) metadata and auxiliary context for each image, ranging from climate zones to geographical clues about the country. (Right): Using open-source LLMs and GPT-4o, we generate four diverse question-answer pairs across geolocation, reasoning, and conversational subsets.
Abstract: Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tremendous progress of large multimodal models (LMMs)βproprietary and open-sourceβresearchers attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle. In this work, we propose to solve this problem by introducing a conversational model
GAEA
that can provide information regarding the location of an image, as required by a user. No large-scale dataset enabling the training of such a model exists. Thus we propose a comprehensive datasetGAEA-1.6M
with 800K images and around 1.6M question-answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark,GAEA-Bench
comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate thatGAEA
significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and best proprietary model, GPT-4o by 8.28%. We will publicly release our dataset and codes.
GAEA
is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.
Main contributions:
GAEA-1.6M: A Diverse Training Dataset:
We propose GAEA-1.6M, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.GAEA-Bench: Evaluating Conversational Geolocalization:
To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.GAEA: An Interactive Geolocalization Chatbot:
We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.Benchmarking Against State-of-the-Art LMMs:
We quantitatively compare our model's performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.
Figure: Data Collection and Annotation Pipeline. (Left) GAEA-1.6M includes geographically diverse visual samples from various data sources, such as MP-16, GLD-v2, and CityGuesser68k. (Middle) We also incorporate OpenStreetMap (OSM) metadata and auxiliary context for each image, ranging from climate zones to geographical clues about the country. (Right) Using open-source LLMs and GPT-4o, we generate four diverse question-answer pairs across geolocation, reasoning, and conversational subsets.
Figure: Overview of `GAEA-Bench`. `GAEA-Bench` is designed to evaluate the conversational abilities of various LMMs across different question types, including MCQs, T/F, and both short and long VQAs. We have carefully selected a subset of 4k samples from MP-16 and generated corresponding OSM metadata to generate QA pairs using GPT-4o. `GAEA-Bench` aims to fill the gap in conversational benchmarks by incorporating geolocalization capabilities.
Figure: The Evaluation pipeline highlights various question types we introduce in our GAEA-Bench. We use GPT-4o as a judge to score such responses on different criterion.
Figure: Our classification accuracy pipeline evaluates city and country predictions by comparing them against ground truth annotations derived from GPS coordinates, with GPT-4o serving as the evaluator.
Statistic | Value |
---|---|
Total images | 822,951 |
Total cities / countries | 41,481 / 234 |
Total questions | 1,580,531 |
Total geo-localization questions | 822,951 |
Total explanatory captions | 384,947 |
Total open-ended questions | 267,668 |
Total multiple-choice questions | 48,673 |
Total true/false questions | 56,292 |
Figure: We showcase various question-types including multiple-choice, true/false, short and long VQAs generated using an open-source model on our GAEA-1.6M dataset. We carefully select geographical tags from OSM metadata to generate diverse question types.
GAEA-1.6M dataset can be downloaded from our huggingface. GAEA-1.6M consists of 1.6M question-answer (MCQ) pairs spanning four question types: MCQs, TF, and Short and Long VQAs. The general structure of our dataset looks like the following:
GAEA-1.6M/
|ββ MP-16/
| |ββ ###/
| | |ββ ###/
| | | |ββ ##########jpg
| | | |ββ ... # remaining images
| |ββ ... # remaining folders with similar structure
|ββ GLDv2/
| |ββ #/
| | |ββ #/
| | | |ββ #/
| | | | |ββ ##########.jpg
| | | | |ββ ... # remaining images
| |ββ ... # remaining folders with similar structure
|ββ CityGuessr/
| |ββ city_#_######.jpg
| |ββ ... # remaining images
Download the dataset
# Download the GAEA-1.6M dataset
cd scripts
chmod +x download_gaea_train.sh
./download_gaea_train.sh
Download the weights to Qwen2.5-VL
# Download Qwen2.5-VL base model
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
conda create -n gaea python=3.10
conda activate gaea
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install qwen-vl-utils
pip install flash-attn==2.5.8 --no-build-isolation
Please install the latest transformers from git to finetune Qwen2.5-VL
pip install git+https://github.com/huggingface/transformers accelerate
cd scripts
chmod +x train_gaea.sh #make script executable.
./train_gaea.sh
GAEA-Bench dataset can be downloaded from our huggingface. GAEA-Bench consists of 4k conversational QA pairs extended from MP-16 and OpenStreetMaps (OSM) in various question types, including MCQs, TF, and Short and Long VQAs.
# Download the GAEA-Bench dataset
cd scripts
chmod +x download_gaea_bench.sh
./download_gaea_bench.sh
# Organize the GAEA
cd scripts
chmod +x prepare_gaea_bench.sh
./prepare_gaea_bench.sh
Run the following command for evaluation
cd scripts
chmod +x run_gaea_bench.sh #make script executable.
./run_gaea_bench.sh
Install IM2GPS, IM2GPS3k, YFCC4k, YFCC26k, and GWS15k to run the evaluation. After installation, update the paths in the shell script and run the evaluation command.
cd scripts
chmod +x run_distance_metrics.sh #make script executable.
./run_distance_metrics.sh
Install CityGuessr, GeoDE, and Dollar Street to run the evaluation. After installation, update the paths in the shell script and run the evaluation command.
cd scripts
chmod +x run_cc_preds.sh #make script executable.
./run_cc_preds.sh
Figure: We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision making questions (MCQs and TFs). We provide the relative performance change for each model compared to `GAEA`.
Figure: We benchmark the performance of various specialized models on standard geolocation datasets. `GAEA` demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k.
Figure: Classification accuracy for both city and country labels, where `GAEA` establishes itself as a strong baseline, surpassing several recent LMMs in performance.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images in GAEA
and GAEA-Bench
dataset are collected from public domains and sources (refer to main paper for more details) and are for academic research use only.
By using GAEA
and GAEA-Bench
, you agree not to use the dataset for any harm or unfair discrimination. Please note that the data in this dataset may be subject to other agreements. Video copyrights belong to the original dataset providers, video creators, or platforms.
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
This repository has borrowed Video-LMM evaluation code from TimeChat and LLaMA-VID. We also borrowed partial code from ALM-Bench, CVRR-Evaluation-Suit repository. We thank the authors for releasing their code.