DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

[🍎 Project Page] [📖 arXiv Paper] [[📊 Dataset]

Overview

Data

We store the data in an easy to use json format. The final filtered dataset can be found in data/discrn_balanced.json and it follows the following format:

{
    "id": "r7",
    "selection_type": "random",
    "q_type": "mc_2",
    "examples": [
        {
            "source": "clothov1_instruct_val",
            "id": "street 2.wav",
            "caption": "A busy street with a car shifting gears in traffic"
            "url": "" # only if available
        },
        {
            "source": "objaverse_pointllm_val",
            "id": "760c0d78327b4846975061c6cd8fd004",
            "caption": "a red sports car with black wheels."
            "url": "" # only if available
        }
    ],
    "modalities": [
        "audio",
        "pc"
    ],
    "questions": [
        "Which scene  evokes more motion?"
    ],
    "answers": [
        "Scene A"
    ],
    "category": "Motion"
}

Data Description

Each dataset entry consists of a unique id, negative sampling type (selection_type), information about the source examples across different modalities, the question and answers, as well as question category and respective model answers across permutations.

Structure

id: Unique identifier for the dataset entry.
selection_type: The method used for selecting negative examples.
q_type: the question type indicating the number of choices.
examples:
- source: The dataset from which the example is taken.
- iD: A unique identifier for the example within its source.
- caption: A description of the content or scene depicted in the example.
- url: The URL to the example if it exists
modalities: the modalities of each of the provided examples. The i'th modality in modalities corresponds the modality of the i'th example in examples.
questions: Example question.
answers: Ground truth answer.
Category: Question category (predicted using in context learning with LLaMa-2 13b)

Data Sources

Download the source data for DisCRn.

Image Data

MSCOCO: Download the MSCOCO dataset Val2014 from here
Densely Captioned Images: Download the Densely Captioned Images source from here after accepting the terms of SA-1B.

Audio Data

We recommend using aac-datasets to download the audio data.

AudioCaps

from aac_datasets import AudioCaps
dataset = AudioCaps(root="/path/to/save/folder", subset="val", download=True)

ClothoV1

from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="eval", download=True)

ClothoV2

from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="val", download=True)

3D Data

Objaverse the formatted data for OneLLM and X-InstructBLIP can be found in objaverse_pc_parallel here. For CREMA, Objaverse data should be preprocessed as described in 3D-LLM

Video Data

MSRVTT can be downloaded from here.
Charades can be downloaded from here

Update Data Directory Roots

We provide data/data2path.json which includes the directories of each file to corresponding id. For each dataset we include a field directory which should be updated to the corresponding data root from the datasets downloaded according to the instructions above.

Baselines

Cross-modal Baselines

Install Baseline Repositories

In cross_modal_baselines/ clone the corresponding code for each baseline and follow the instructions to create separate environments for each of them.

X-InstructBLIP. To set-up the environment run the commands below.

cd cross_modal_baselines/
git clone https://github.com/salesforce/LAVIS.git
conda create -n lavis python=3.8
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
wget -P /usr/bin https://github.com/unlimblue/KNN_CUDA/raw/master/ninja

To run the baseline run

conda activate lavis
python xinstructblip.py

CREMA/. To setup Crema follow the instructions below.

git clone https://github.com/Yui010206/CREMA.git
conda create -n crema python=3.8
conda activate crema 
cd CREMA
pip install -e .

To run the baseline run

conda activate crema
python crema.py

OneLLM/. To setup Crema follow the instructions below.

git clone https://github.com/csuhan/OneLLM
conda create -n onellm python=3.9 -y
conda activate onellm
pip install -r requirements.txt
cd model/lib/pointnet2
python setup.py install

To run the baseline run

conda activate onellm
python onellm.py

Caption Baselines

Install caption_models/requirements.txt. Then run

cd caption_models
python caption_baseline.py --type type_of_input --model_id huggingface/model/id/or/path

The --type can be any of random, predicted, oracle, no_input.

Evaluation

To compute accuracy across different subsets of the dataset run

python eval.py --results_file path/to/results/file

and to compute MSNR (Multimodal Signal over Noise Ratio) run

python eval.py --results_file path/to/results/file --no_input_results_file path/to/no/input/results/file --random_results_file path/to/no/random/results/file

Citation

TBA

Note

This repository and data is for research purposes only and reproducibility of the work.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
caption_models		caption_models
cross_modal_baselines		cross_modal_baselines
data		data
data_generation		data_generation
misc		misc
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASHEET.md		DATASHEET.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
eval.py		eval.py
eval_msnr.py		eval_msnr.py
license_info.md		license_info.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

Overview

Data

Data Description

Structure

Data Sources

Update Data Directory Roots

Baselines

Cross-modal Baselines

Install Baseline Repositories

Caption Baselines

Evaluation

Citation

Note

About

Releases

Packages

Contributors 2

Languages

License

SalesforceAIResearch/DisCRn_Bench

Folders and files

Latest commit

History

Repository files navigation

DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

Overview

Data

Data Description

Structure

Data Sources

Update Data Directory Roots

Baselines

Cross-modal Baselines

Install Baseline Repositories

Caption Baselines

Evaluation

Citation

Note

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages