Skip to content

SalesforceAIResearch/DisCRn_Bench

DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

[🍎 Project Page] [📖 arXiv Paper] [[📊 Dataset]

Overview

Data

We store the data in an easy to use json format. The final filtered dataset can be found in data/discrn_balanced.json and it follows the following format:

{
    "id": "r7",
    "selection_type": "random",
    "q_type": "mc_2",
    "examples": [
        {
            "source": "clothov1_instruct_val",
            "id": "street 2.wav",
            "caption": "A busy street with a car shifting gears in traffic"
            "url": "" # only if available
        },
        {
            "source": "objaverse_pointllm_val",
            "id": "760c0d78327b4846975061c6cd8fd004",
            "caption": "a red sports car with black wheels."
            "url": "" # only if available
        }
    ],
    "modalities": [
        "audio",
        "pc"
    ],
    "questions": [
        "Which scene  evokes more motion?"
    ],
    "answers": [
        "Scene A"
    ],
    "category": "Motion"
}

Data Description

Each dataset entry consists of a unique id, negative sampling type (selection_type), information about the source examples across different modalities, the question and answers, as well as question category and respective model answers across permutations.

Structure

  • id: Unique identifier for the dataset entry.
  • selection_type: The method used for selecting negative examples.
  • q_type: the question type indicating the number of choices.
  • examples:
    • source: The dataset from which the example is taken.
    • iD: A unique identifier for the example within its source.
    • caption: A description of the content or scene depicted in the example.
    • url: The URL to the example if it exists
  • modalities: the modalities of each of the provided examples. The i'th modality in modalities corresponds the modality of the i'th example in examples.
  • questions: Example question.
  • answers: Ground truth answer.
  • Category: Question category (predicted using in context learning with LLaMa-2 13b)

Data Sources

Download the source data for DisCRn.

Image Data

Audio Data

We recommend using aac-datasets to download the audio data.

from aac_datasets import AudioCaps
dataset = AudioCaps(root="/path/to/save/folder", subset="val", download=True)
from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="eval", download=True)
from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="val", download=True)

3D Data

  • Objaverse the formatted data for OneLLM and X-InstructBLIP can be found in objaverse_pc_parallel here. For CREMA, Objaverse data should be preprocessed as described in 3D-LLM

Video Data

Update Data Directory Roots

We provide data/data2path.json which includes the directories of each file to corresponding id. For each dataset we include a field directory which should be updated to the corresponding data root from the datasets downloaded according to the instructions above.

Baselines

Cross-modal Baselines

Install Baseline Repositories

In cross_modal_baselines/ clone the corresponding code for each baseline and follow the instructions to create separate environments for each of them.

cd cross_modal_baselines/
git clone https://github.com/salesforce/LAVIS.git
conda create -n lavis python=3.8
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
wget -P /usr/bin https://github.com/unlimblue/KNN_CUDA/raw/master/ninja

To run the baseline run

conda activate lavis
python xinstructblip.py
  • CREMA/. To setup Crema follow the instructions below.
git clone https://github.com/Yui010206/CREMA.git
conda create -n crema python=3.8
conda activate crema 
cd CREMA
pip install -e .

To run the baseline run

conda activate crema
python crema.py
  • OneLLM/. To setup Crema follow the instructions below.
git clone https://github.com/csuhan/OneLLM
conda create -n onellm python=3.9 -y
conda activate onellm
pip install -r requirements.txt
cd model/lib/pointnet2
python setup.py install

To run the baseline run

conda activate onellm
python onellm.py

Caption Baselines

Install caption_models/requirements.txt. Then run

cd caption_models
python caption_baseline.py --type type_of_input --model_id huggingface/model/id/or/path

The --type can be any of random, predicted, oracle, no_input.

Evaluation

To compute accuracy across different subsets of the dataset run

python eval.py --results_file path/to/results/file

and to compute MSNR (Multimodal Signal over Noise Ratio) run

python eval.py --results_file path/to/results/file --no_input_results_file path/to/no/input/results/file --random_results_file path/to/no/random/results/file 

Citation

TBA

Note

This repository and data is for research purposes only and reproducibility of the work.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages