How good is BERT ? Comparing BERT to other state-of-the-art approaches on a large-scale French sentiment analysis dataset 📚
The contribution of this repository is threefold.
-
Firstly, I introduce a new dataset for sentiment analysis, scraped from Allociné.fr user reviews. It contains 100k positive and 100k negative reviews divided into 3 balanced splits: train (160k reviews), val (20k) and test (20k). At my knowledge, there is no dataset of this size in French language available on the internet.
-
Secondly, I share my code for French sentiment analysis with BERT, based on CamemBERT, and the 🤗Transformers library.
-
Lastly, I compare BERT results with other state-of-the-art approaches, such as TF-IDF and fastText, as well as other non-contextual word embeddings based methods.
If you want to experiment with the training code, follow these steps:
# Download repo and its dependencies
git clone https://github.com/TheophileBlard/french-sentiment-analysis-with-bert/
cd french-sentiment-analysis-with-bert
pipenv install
# Extract dataset
pushd allocine_dataset && tar xvjf data.tar.bz2 && popd
# Activate virtualenv and open-up BERT notebook
pipenv shell
jupyter notebook 03_bert.ipynb
But if you only need the model for inference, please refer to this paragraph.
The dataset is made available as .jsonl
files, as well as a .pickle
file.
Some examples from the training set are presented in the following table:
Review | Polarity |
---|---|
Magnifique épopée, une belle histoire, touchante avec des acteurs qui interprètent très bien leur rôles (Mel Gibson, Heath Ledger, Jason Isaacs...), le genre de film qui se savoure en famille! | Positive |
N'étant pas fan de SF, j'ai du mal à commenter ce film. Au moins, dirons nous, il n'y a pas d'effets spéciaux et le thème de ces 3 derniers survivants, un blanc, un maori, une blanche est assez bien traité. Mais c'est quand même bien longuet ! | Negative |
Les scènes s'enchaînent de manière saccadée, les dialogues sont théâtraux, le jeu des acteurs ne transcende pas franchement le film. Seule la musique de Vivaldi sauve le tout. Belle déception. | Negative |
For more information, please refer to the dedicated page.
The dataset is also available in the 🤗Datasets library, please refer to this paragraph.
Model | Validation Accuracy | Validation F1-Score | Test Accuracy | Test F1-Score |
---|---|---|---|---|
CamemBERT | 97.39 | 97.36 | 97.44 | 97.34 |
RNN | 94.39 | 94.34 | 94.58 | 94.39 |
TF-IDF + LogReg | 94.35 | 94.29 | 94.38 | 94.19 |
CNN | 93.69 | 93.72 | 94.10 | 93.98 |
fastText (unigrams) | 92.88 | 92.75 | 92.90 | 92.57 |
CamemBERT outperforms all other models by a large margin.
Test accuracy as a function of training dataset size.
With only 500 training examples, CamemBERT is already showing better results that any other model trained on the full dataset. This is the power of modern language models and self-supervised pre-training.
For this kind of tasks, RNNs need a lot of data (>100k) to perform well. The same result (for English language) is empirically observed by Alec Radford in these slides.
Time taken by a model to perform a single prediction (averaged on 1000 predictions).
As one would expect, the slowest model is CamemBERT, followed by TF-IDF.
On the other hand, fastText performs the ... fastest, but is actually slow compared to the original implementation, because of the overhead of Python and Keras.
I considered the text classification task from FLUE (French Language Understanding Evaluation) to evaluate the cross-domain generalization capabilities of the models. This is also a binary classification task, but on Amazon product reviews.
There is one train and test set for each product category (books, DVD and music). The train and test sets are balanced, including around 1000 positive and 1000 negative reviews, for a total of 2000 reviews in each dataset.
I didn't do any additional training, only inference on the test sets. The resulting accuracies are reported in the following table:
Model | Books | DVD | Music |
---|---|---|---|
CamemBERT | 94.10 | 93.25 | 94.55 |
TF-IDF + LogReg | 87.10 | 88.10 | 87.45 |
CNN | 85.80 | 88.75 | 87.25 |
RNN | 85.30 | 87.55 | 87.50 |
fastText (unigrams) | 85.25 | 87.10 | 86.65 |
Without additional training on domain-specific data, the CamemBERT model outperforms finetuned CamemBERT & FlauBERT models reported in (He et al., 2020). Update: FlauBERT (Large) released 03/20 gets better results, but it is excessively heavy.
TF-IDF + LogReg also performs better than specifically-trained mBERT (Eisenschlos et al., 2019).
The CamemBERT model is now part of the 🤗Transformers library ! You can retrieve it and perform inference with the following code:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine")
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
print(nlp("Alad'2 est clairement le meilleur film de l'année 2018.")) # POSITIVE
print(nlp("Juste whoaaahouuu !")) # POSITIVE
print(nlp("NUL...A...CHIER ! FIN DE TRANSMISSION.")) # NEGATIVE
print(nlp("Je m'attendais à mieux de la part de Franck Dubosc !")) # NEGATIVE
The dataset is also available in 🤗Datasets. To download it and start training your own model, simply use:
from datasets import load_dataset
train_ds, val_ds, test_ds = load_dataset(
'allocine',
split=['train', 'validation', 'test']
)
Open the online demo on Google Colab:
- 0.4.0
- Uploaded model to https://huggingface.co/tblard/tf-allocine
- Uploaded the dataset to https://huggingface.co/datasets/viewer/?dataset=allocine
- 0.3.0
- Added Google Colab online demo
- 0.2.0
- Added inference time + generalizability
- 0.1.0
- First proper release
- Learning curves & results for all models
- 0.0.1
- Work in progress
- Dataset available
- Models available
- Results on full dataset
- Learning curves
- Inference time
- Generalizability
- Online demo
- Hugging Face integration
- Predicting usefulness
Théophile Blard – 📧 theophile.blard@gmail.com
If you use this work (code or dataset), please cite as:
Théophile Blard, French sentiment analysis with BERT, (2020), GitHub repository, https://github.com/TheophileBlard/french-sentiment-analysis-with-bert