This repository contains the code for the paper [TextTrojaners at CheckThat! 2024: Robustness of Credibility Assessment with Adversarial Examples through BeamAttack]. The paper describes the submission of TextTrojaners for the CheckThat! 2024 lab task 6: Robustness of Credibility Assessment with Adversarial Examples at the Conference and Labs of the Evaluation Forum 2024 in Grenoble.
Many social media platforms employ machine learning for content filtering, attempting to detect content that is misleading, harmful, or simply illegal. For example, imagine the following message has been classified as harmful misinformation:
Water causes death! 100%! Stop drinking now! #NoWaterForMe
However, would it also be stopped if we changed ‘causes’ to ‘is responsible for’? Or ‘cuases’, ‘caυses’ or ‘Causes’? Will the classifier maintain its accurate response? How many changes do we need to make to trick it into changing the decision? This is what we aim to find out in the shared task.
For more information please refer to the shared task description here CheckThat! 2024 lab task 6: Robustness of Credibility Assessment with Adversarial Examples.
Our approach BeamAttack is a novel algorithm for generating adversarial examples in natural language processing through the application of beam search. To further improve the search process, we integrate a semantic filter that prioritizes examples with the highest semantic similarity to the original sample, enabling early termination of the search. Additionally, we leverage a model interpretability technique, LIME, to determine the priority of word replacements, along with existing methods such as that determine word importance through the model’s logits. Our approach also allows for skipping and removing words, enabling the discovery of minimal modifications that flip the label. Furthermore, we utilize a masked language model to predict contextually plausible alternatives to the words to be replaced, enhancing the coherence of the generated adversarial examples. BeamAttack demonstrates state-of-the-art performance, outperforming existing methods with scores of up to 0.90 on the BiLSTM, 0.84 on BERT, and 0.82 on the RoBERTa classifier.
This repository is structured into three primary directories:
beam_attack/
: This directory contains the implementation of our attack algorithm.BODEGA/
: This sub repository houses the evaluation functionality.clef2024-checkthat-lab/
: This sub repository stores the datasets and target models utilized in this project.
Additionally, we would like to draw attention to the following key file:
beamattack.ipynb
: This Jupyter Notebook provides an exemplary demonstration of how our algorithm can be employed in an OpenAttack setup, with a focus on evaluating the BODEGA score. This example is based on the template presented in the shared task, which can be accessed at this link. We used this file to create our submission of the shared task.
The directory structure is as follows:
.
├── beam_attack/
├── BODEGA/
├── clef2024-checkthat-lab/
|
└── beamattack.ipynb
Clone this repository with all its subrepositories:
git clone --recurse-submodules https://github.com//ml4nlp2-adversarial-attack
In case the submodules where not clones automatically, it is possible to do with:
# cd ml4nlp2-adversarial-attack/
git submodule update --init --recursive
The author emails are {arnisa.fazla, david.guzmanpiedrahita, lucassteffen.krauter}@uzh.ch
Please contact david.guzmanpiedrahita@uzh.ch first if you have any questions.
note: add after camera ready submission