DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

[🍎 Project Page] [📖 arXiv Paper] [[📊 Dataset]

Overview

Data

We store the data in an easy to use json format. The final filtered dataset can be found in data/discrn_balanced.json and it follows the following format:

{
    "id": "r7",
    "selection_type": "random",
    "q_type": "mc_2",
    "examples": [
        {
            "source": "clothov1_instruct_val",
            "id": "street 2.wav",
            "caption": "A busy street with a car shifting gears in traffic"
            "url": "" # only if available
        },
        {
            "source": "objaverse_pointllm_val",
            "id": "760c0d78327b4846975061c6cd8fd004",
            "caption": "a red sports car with black wheels."
            "url": "" # only if available
        }
    ],
    "modalities": [
        "audio",
        "pc"
    ],
    "questions": [
        "Which scene  evokes more motion?"
    ],
    "answers": [
        "Scene A"
    ],
    "category": "Motion"
}

Data Description

Each dataset entry consists of a unique id, negative sampling type (selection_type), information about the source examples across different modalities, the question and answers, as well as question category and respective model answers across permutations.

Structure

id: Unique identifier for the dataset entry.
selection_type: The method used for selecting negative examples.
q_type: the question type indicating the number of choices.
examples:
- source: The dataset from which the example is taken.
- iD: A unique identifier for the example within its source.
- caption: A description of the content or scene depicted in the example.
- url: The URL to the example if it exists
modalities: the modalities of each of the provided examples. The i'th modality in modalities corresponds the modality of the i'th example in examples.
questions: Example question.
answers: Ground truth answer.
Category: Question category (predicted using in context learning with LLaMa-2 13b)

Data Sources

Download the source data for DisCRn.

Image Data

MSCOCO: Download the MSCOCO dataset Val2014 from here
Densely Captioned Images: Download the Densely Captioned Images source from here after accepting the terms of SA-1B.

Audio Data

We recommend using aac-datasets to download the audio data.

AudioCaps

from aac_datasets import AudioCaps
dataset = AudioCaps(root="/path/to/save/folder", subset="val", download=True)

ClothoV1

from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="eval", download=True)

ClothoV2

from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="val", download=True)

3D Data

Objaverse the formatted data for OneLLM and X-InstructBLIP can be found in objaverse_pc_parallel here. For CREMA, Objaverse data should be preprocessed as described in 3D-LLM

Video Data

MSRVTT can be downloaded from here.
Charades can be downloaded from here

Update Data Directory Roots

We provide data/data2path.json which includes the directories of each file to corresponding id. For each dataset we include a field directory which should be updated to the corresponding data root from the datasets downloaded according to the instructions above.

Baselines

Cross-modal Baselines

Install Baseline Repositories

In cross_modal_baselines/ clone the corresponding code for each baseline and follow the instructions to create separate environments for each of them.

X-InstructBLIP. To set-up the environment run the commands below.

cd cross_modal_baselines/
git clone https://github.com/salesforce/LAVIS.git
conda create -n lavis python=3.8
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
wget -P /usr/bin https://github.com/unlimblue/KNN_CUDA/raw/master/ninja

To run the baseline run

conda activate lavis
python xinstructblip.py

CREMA/. To setup Crema follow the instructions below.

git clone https://github.com/Yui010206/CREMA.git
conda create -n crema python=3.8
conda activate crema 
cd CREMA
pip install -e .

To run the baseline run

conda activate crema
python crema.py

OneLLM/. To setup Crema follow the instructions below.

git clone https://github.com/csuhan/OneLLM
conda create -n onellm python=3.9 -y
conda activate onellm
pip install -r requirements.txt
cd model/lib/pointnet2
python setup.py install

To run the baseline run

conda activate onellm
python onellm.py

Caption Baselines

Install caption_models/requirements.txt. Then run

cd caption_models
python caption_baseline.py --type type_of_input --model_id huggingface/model/id/or/path

The --type can be any of random, predicted, oracle, no_input.

Evaluation

To compute accuracy across different subsets of the dataset run

python eval.py --results_file path/to/results/file

and to compute MSNR (Multimodal Signal over Noise Ratio) run

python eval.py --results_file path/to/results/file --no_input_results_file path/to/no/input/results/file --random_results_file path/to/no/random/results/file

Citation

TBA

Note

This repository and data is for research purposes only and reproducibility of the work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

Overview

Data

Data Description

Structure

Data Sources

Update Data Directory Roots

Baselines

Cross-modal Baselines

Install Baseline Repositories

Caption Baselines

Evaluation

Citation

Note

Files

README.md

Latest commit

History

README.md

File metadata and controls

DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

Overview

Data

Data Description

Structure

Data Sources

Update Data Directory Roots

Baselines

Cross-modal Baselines

Install Baseline Repositories

Caption Baselines

Evaluation

Citation

Note