Skip to content

ispamm/GRAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

34 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

If you like our project, please give us a star โญ on GitHub for latest update.

arXiv OpenReview Discussion

License Hits GitHub Issues or Pull Requests GitHub Issues or Pull Requests

PWC PWC PWC PWC PWC PWC


๐Ÿ“ฐ News

  • [2025.01.22] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Paper got accepted at ICLR 2025!! See you in Singapore!
  • [2024.12.18] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ The checkpoints are available here!
  • [2024.12.18] Code is available now! Welcome to watch ๐Ÿ‘€ this repository for the latest updates.
  • [2024.12.17] The paper has been published on Arxiv ๐ŸŽ‰. The pdf version is available here!

๐Ÿ˜ฎ Highlights

๐Ÿ’ก Radical change in the field of multimodal contrastive learning

GRAM learns and then aligns modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the k-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously.

๐Ÿ”ฅ SOTA Performance in almost all retrieval task

GRAM can replace cosine similarity in any downstream method, holding for 2 to modality and providing more meaningful alignment with respect to previous similarity measures. Moreover, the novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification.

๐Ÿ‘€ Multimodal alignement unlock new and fancy downstream task

An aligned shared latent space among n modalities is a strong baseline for whatever downstream task that rely on embedding extraction. The results obtained from this paper will lead to superior performance in existing downstream tasks (T2I, T2V, V2A, etc.) but also unlock fancy tasks such as for example image to audio generation or image generation conditioned on text and audio.

๐Ÿš€ Main Results

Building Environment

GRAM is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.

conda create -n gram python=3.9
conda activate gram
sh preinstall.sh

Download basic encoder's pretrained checkpoints

Make a dir named pretrained_weights under the main work dir.

  1. Download evaclip weight:
wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt
  1. Download beats weight from https://github.com/microsoft/unilm/tree/master/beats

  2. Download bert weight:

from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')

The processed pretrained_weights path should be as follows:

    โ”œโ”€โ”€ pretrained_weights
    โ”‚ย ย  โ”œโ”€โ”€ beats
    โ”‚   โ”‚ย   โ””โ”€โ”€ BEATs_iter3_plus_AS2M.pt
    โ”‚ย ย  โ”œโ”€โ”€ bert
    โ”‚   โ”‚   โ””โ”€โ”€ bert-base-uncased
    โ”‚ย ย  โ”œโ”€โ”€ clip
    โ”‚   โ”‚   โ””โ”€โ”€ EVA01_CLIP_g_14_psz14_s11B.pt

MODEL ZOO

All models are available here!

NameTraining DatasetTesting DatasetR@1 in Testing Dataset link
GRAM_pretrained_5modalitiesVast27M 150k Subset TVASMSRVTT54.8link
GRAM_pretrained_4modalitiesVast27M 150k Subset TVASDMSRVTT55.3link
GRAM_finetuned_MSRVTTMSRVTTMSRVTT64.0link
GRAM_finetuned_DIDEMODIDEMODIDEMO67.3link
GRAM_finetuned_ANETActivityNetActivityNet69.9link
GRAM_finetuned_VATEXVATEXVATEX87.7link

Download the entire folder that consists of a subfolder "log" and another one "ckpt. Place the folder whatever you prefer and record the location for future commands.

An example of paths after the download could be as follow:

    โ”œโ”€โ”€ pretrained_models
    โ”‚ย ย  โ”œโ”€โ”€ GRAM_pretrained_4modalities
    โ”‚   โ”‚ย   โ”œโ”€โ”€ log
    โ”‚   โ”‚ย   โ”œโ”€โ”€ ckpt    

Download VAST-27M annotations for pretraining

VAST-27M DATASET could be downloaded following the official repo

We used a subset of VAST-27M for the pretraining phase of GRAM. This is the annotation file used here

Finetune Model on the 150k subset of VAST27M

Download annotations150k.json file subset. Reference it in scripts/gram/finetune_ret.sh and in config/gram/finetune_cfg/finetune-area.json

sh scripts/gram/finetune_ret.sh

Finetune Model on downstream datasets

Change configuration internally at scripts/gram/finetune_ret.sh and then run

sh scripts/gram/finetune_ret.sh

Test your finetuned Model

For example, if the cmd for finetuning retrieval model is as follows:

python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/gram/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $PATH-TO-CKPT-FOLDER \
--output_dir $PATH-WHERE-TO-STORE-RESULTS \

if you want to test model, just add following two rows to the cmd:

--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt

Citation

If you find this code useful for your research, please consider citing the following paper:

@misc{cicchetti2024gramianmultimodalrepresentationlearning,
      title={Gramian Multimodal Representation Learning and Alignment}, 
      author={Giordano Cicchetti and Eleonora Grassucci and Luigi Sigillo and Danilo Comminiello},
      year={2024},
      eprint={2412.11959},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.11959}, 
}

Star History

Star History Chart

Third-Party Licenses

For the full list of third-party licenses used in this project, please see the THIRD_PARTY_LICENSES.md file.