We created a benchmark for learning with noisy labels. More details and experiments can be found in our EMNLP 2024 paper.
NoiseBench serves for benchmarking the impact of real label noise on named entity recognition. It is based on a subset of the English CoNLL-03 dataset and consists of 1 ground-truth label set and 6 variants of noisy labels:
- Clean
- Expert noise
- Crowd noise
- Crowd noise (best-case)
- Distant supervision noise
- Weak supervision noise
- LLM noise
We provide the annotation-only files in data/annotations
. The annotations follow the IOB2 scheme. We masked the tokens in the included sentences with [TOK] due to the license of the Reuters Corpus that CoNLL-03 is based on. We take the CleanCoNLL annotations as a ground truth.
This script generates the NoiseBench dataset variants in data/noisebench
.
- Run the script:
bash create_noisebench.sh
(if the git clone
command from Option 1 is not available)
-
Download the full CleanCoNLL dataset in the
data/cleanconll
folder according to the instructions in https://github.com/flairNLP/CleanCoNLL.git. -
Create noisy datasets
python scripts/generate_data_files.py
- Requirements
conda create -n noisebench python=3.10
conda activate noisebench
pip install -r requirements.txt
- Run main experiment script
python main.py --config configs/exp1_real_noise.json
- Run simulated noise generation
python scripts/calculate_data_overviews.py
python scripts/create_simulated_noisy_sets.py
- Run main experiment script
python main.py --config configs/exp1_simulated_noise.json
This script generates the german version in data/noisebench_german
.
-
Get the full dataset following the instructions here: https://www.clips.uantwerpen.be/conll2003/ner/. With this, the full dataset (files
deu.train
,deu.testa
,deu.testb
) should be downloaded in thedata/conll_german
directory. -
Run:
python scripts/generate_german_data_files.py