Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion

The official repository for "Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion", Pattern Recognition Letters (submitted)

Overview

This study addresses key challenges in automatic emotion recognition (ER), particularly the limitations of single-corpus training that hinder generalizability. To overcome these issues, the authors introduce a novel multi-corpus, multimodal ER method evaluated using a leave-one-corpus-out (LOCO) protocol. This approach incorporates fine-tuned encoders for audio, video, and text, combined through a context-independent gated attention mechanism for cross-modal feature fusion.

Pipeline Overview

Components of the Proposed Method

Visual Encoder: Fine-tuned ResNet-50.
Acoustic Encoder: Fine-tuned emotional wav2vec2.
Linguistic Encoder: Fine-tuned RoBERTa model.
Audio/Video Segment Aggregation: Extraction of feature statistics (mean and standard deviation (STD) values).
Multimodal Feature Aggregation: Cross-modal, context-independent gated attention mechanism.

Key Results

The proposed method achieves state-of-the-art performance across multiple benchmark corpora, including MOSEI, MELD, IEMOCAP, and AFEW. The study reveals that models trained on MELD demonstrate superior cross-corpus generalization. Additionally, AFEW annotations show strong alignment with other corpora, resulting in the best cross-corpus performance. These findings validate the robustness and applicability of the method across diverse real-world scenarios.

User Guide

Accessing the Models

The pre-trained models are available here.

Inference

To predict emotions for your multimodal files, configure the config.toml file with paths to the models and files, and then run python src/inference.py.

Training

To train the multimodal model, first extract features from your data python src/avt_feature_extraction.py.

Then, initiate training with python src/train_avt_model.py.

Ensure the config.toml file is properly configured for both steps.

Citation

If you are using our models in your research, please consider to cite research:

@article{RYUMINA2024,
  title        = {Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion},
  author       = {Elena Ryumina and Dmitry Ryumin and Alexandr Axyonov and Denis Ivanko and Alexey Karpov},
  journal      = {Pattern Recognition Letters},
  year         = {2024},
}

Acknowledgments

Parts of this project page were adopted from the Nerfies page.

Website License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
static		static
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion

Overview

Pipeline Overview

Components of the Proposed Method

Key Results

User Guide

Accessing the Models

Inference

Training

Citation

Acknowledgments

Website License

About

Releases

Packages

Contributors 3

Languages

SMIL-SPCRAS/MER

Folders and files

Latest commit

History

Repository files navigation

Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion

Overview

Pipeline Overview

Components of the Proposed Method

Key Results

User Guide

Accessing the Models

Inference

Training

Citation

Acknowledgments

Website License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages