[🍎 Project Page] [📖 arXiv Paper] [[📊 Dataset]
We store the data in an easy to use json
format. The final filtered dataset can be found in data/discrn_balanced.json
and it follows the following format:
{
"id": "r7",
"selection_type": "random",
"q_type": "mc_2",
"examples": [
{
"source": "clothov1_instruct_val",
"id": "street 2.wav",
"caption": "A busy street with a car shifting gears in traffic"
"url": "" # only if available
},
{
"source": "objaverse_pointllm_val",
"id": "760c0d78327b4846975061c6cd8fd004",
"caption": "a red sports car with black wheels."
"url": "" # only if available
}
],
"modalities": [
"audio",
"pc"
],
"questions": [
"Which scene evokes more motion?"
],
"answers": [
"Scene A"
],
"category": "Motion"
}
Each dataset entry consists of a unique id
, negative sampling type (selection_type
), information about the source examples
across different modalities
, the question
and answers
, as well as question category
and respective model answers across permutations.
- id: Unique identifier for the dataset entry.
- selection_type: The method used for selecting negative examples.
- q_type: the question type indicating the number of choices.
- examples:
source
: The dataset from which the example is taken.iD
: A unique identifier for the example within its source.caption
: A description of the content or scene depicted in the example.url
: The URL to the example if it exists
- modalities: the modalities of each of the provided examples. The i'th modality in
modalities
corresponds the modality of the i'th example inexamples
. - questions: Example question.
- answers: Ground truth answer.
- Category: Question category (predicted using in context learning with LLaMa-2 13b)
Download the source data for DisCRn.
Image Data
- MSCOCO: Download the MSCOCO dataset Val2014 from here
- Densely Captioned Images: Download the Densely Captioned Images source from here after accepting the terms of SA-1B.
Audio Data
We recommend using aac-datasets
to download the audio data.
from aac_datasets import AudioCaps
dataset = AudioCaps(root="/path/to/save/folder", subset="val", download=True)
from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="eval", download=True)
from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="val", download=True)
3D Data
- Objaverse the formatted data for OneLLM and X-InstructBLIP can be found in
objaverse_pc_parallel
here. For CREMA, Objaverse data should be preprocessed as described in 3D-LLM
Video Data
We provide data/data2path.json
which includes the directories of each file to corresponding id. For each dataset we include a field directory
which should be updated to the corresponding data root from the datasets downloaded according to the instructions above.
In cross_modal_baselines/
clone the corresponding code for each baseline and follow the instructions to create separate environments for each of them.
- X-InstructBLIP. To set-up the environment run the commands below.
cd cross_modal_baselines/
git clone https://github.com/salesforce/LAVIS.git
conda create -n lavis python=3.8
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
wget -P /usr/bin https://github.com/unlimblue/KNN_CUDA/raw/master/ninja
To run the baseline run
conda activate lavis
python xinstructblip.py
- CREMA/. To setup Crema follow the instructions below.
git clone https://github.com/Yui010206/CREMA.git
conda create -n crema python=3.8
conda activate crema
cd CREMA
pip install -e .
To run the baseline run
conda activate crema
python crema.py
- OneLLM/. To setup Crema follow the instructions below.
git clone https://github.com/csuhan/OneLLM
conda create -n onellm python=3.9 -y
conda activate onellm
pip install -r requirements.txt
cd model/lib/pointnet2
python setup.py install
To run the baseline run
conda activate onellm
python onellm.py
Install caption_models/requirements.txt
. Then run
cd caption_models
python caption_baseline.py --type type_of_input --model_id huggingface/model/id/or/path
The --type
can be any of random
, predicted
, oracle
, no_input
.
To compute accuracy across different subsets of the dataset run
python eval.py --results_file path/to/results/file
and to compute MSNR (Multimodal Signal over Noise Ratio) run
python eval.py --results_file path/to/results/file --no_input_results_file path/to/no/input/results/file --random_results_file path/to/no/random/results/file
TBA
This repository and data is for research purposes only and reproducibility of the work.