Skip to content

Latest commit

 

History

History
174 lines (148 loc) · 6.92 KB

README.md

File metadata and controls

174 lines (148 loc) · 6.92 KB

DisCRn Benchmark: Evaluating Discriminatory Cross-Modal Reasoning in Audio, Video, Image, and 3D

[🍎 Project Page] [📖 arXiv Paper] [[📊 Dataset]

Overview

Data

We store the data in an easy to use json format. The final filtered dataset can be found in data/discrn_balanced.json and it follows the following format:

{
    "id": "r7",
    "selection_type": "random",
    "q_type": "mc_2",
    "examples": [
        {
            "source": "clothov1_instruct_val",
            "id": "street 2.wav",
            "caption": "A busy street with a car shifting gears in traffic"
            "url": "" # only if available
        },
        {
            "source": "objaverse_pointllm_val",
            "id": "760c0d78327b4846975061c6cd8fd004",
            "caption": "a red sports car with black wheels."
            "url": "" # only if available
        }
    ],
    "modalities": [
        "audio",
        "pc"
    ],
    "questions": [
        "Which scene  evokes more motion?"
    ],
    "answers": [
        "Scene A"
    ],
    "category": "Motion"
}

Data Description

Each dataset entry consists of a unique id, negative sampling type (selection_type), information about the source examples across different modalities, the question and answers, as well as question category and respective model answers across permutations.

Structure

  • id: Unique identifier for the dataset entry.
  • selection_type: The method used for selecting negative examples.
  • q_type: the question type indicating the number of choices.
  • examples:
    • source: The dataset from which the example is taken.
    • iD: A unique identifier for the example within its source.
    • caption: A description of the content or scene depicted in the example.
    • url: The URL to the example if it exists
  • modalities: the modalities of each of the provided examples. The i'th modality in modalities corresponds the modality of the i'th example in examples.
  • questions: Example question.
  • answers: Ground truth answer.
  • Category: Question category (predicted using in context learning with LLaMa-2 13b)

Data Sources

Download the source data for DisCRn.

Image Data

Audio Data

We recommend using aac-datasets to download the audio data.

from aac_datasets import AudioCaps
dataset = AudioCaps(root="/path/to/save/folder", subset="val", download=True)
from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="eval", download=True)
from aac_datasets import Clotho
dataset = Clotho(root="/path/to/save/folder", subset="val", download=True)

3D Data

  • Objaverse the formatted data for OneLLM and X-InstructBLIP can be found in objaverse_pc_parallel here. For CREMA, Objaverse data should be preprocessed as described in 3D-LLM

Video Data

Update Data Directory Roots

We provide data/data2path.json which includes the directories of each file to corresponding id. For each dataset we include a field directory which should be updated to the corresponding data root from the datasets downloaded according to the instructions above.

Baselines

Cross-modal Baselines

Install Baseline Repositories

In cross_modal_baselines/ clone the corresponding code for each baseline and follow the instructions to create separate environments for each of them.

cd cross_modal_baselines/
git clone https://github.com/salesforce/LAVIS.git
conda create -n lavis python=3.8
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
wget -P /usr/bin https://github.com/unlimblue/KNN_CUDA/raw/master/ninja

To run the baseline run

conda activate lavis
python xinstructblip.py
  • CREMA/. To setup Crema follow the instructions below.
git clone https://github.com/Yui010206/CREMA.git
conda create -n crema python=3.8
conda activate crema 
cd CREMA
pip install -e .

To run the baseline run

conda activate crema
python crema.py
  • OneLLM/. To setup Crema follow the instructions below.
git clone https://github.com/csuhan/OneLLM
conda create -n onellm python=3.9 -y
conda activate onellm
pip install -r requirements.txt
cd model/lib/pointnet2
python setup.py install

To run the baseline run

conda activate onellm
python onellm.py

Caption Baselines

Install caption_models/requirements.txt. Then run

cd caption_models
python caption_baseline.py --type type_of_input --model_id huggingface/model/id/or/path

The --type can be any of random, predicted, oracle, no_input.

Evaluation

To compute accuracy across different subsets of the dataset run

python eval.py --results_file path/to/results/file

and to compute MSNR (Multimodal Signal over Noise Ratio) run

python eval.py --results_file path/to/results/file --no_input_results_file path/to/no/input/results/file --random_results_file path/to/no/random/results/file 

Citation

TBA

Note

This repository and data is for research purposes only and reproducibility of the work.