WikiSplit++: Easy Data Refinement for Split and Rephrase

WikiSplit++ enhances the original WikiSplit by applying two techniques: filtering through NLI classification and sentence-order reversing, which help to remove noise and reduce hallucinations compared to the original WikiSplit.
The preprocessed WikiSplit dataset that formed the basis for this can be found here.

Description

The train split of WikiSplit++ includes the train, val, and tune splits from WikiSplit. The origin of each data item, whether it is from the train, val, or tune split of WikiSplit, can be identified by the split entry within the data. We did not use the test split of WikiSplit as it is being used for the construction of Wiki-BM. We re-divided the train, val, and tune splits of WikiSplit into new train, val, and test splits for intrinsic evaluations.

Instrallation & Preparation

# install dependencies
rye sync
source ./.venv/bin/activate

# download and preprocess the datasets
bash src/download.sh
bash src/create-datasets.sh

Training

python src/train.py --method "split_reverse" --model_name "t5-small" --dataset_dir "./datasets" --dataset_name "wiki-split/entailment"

License

This software is released under the NTT License, see LICENSE.txt.
According to the license, it is not allowed to create pull requests. Please feel free to send issues.

Our dataset is publicly available on HuggingFace under the CC BY-SA 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSplit++: Easy Data Refinement for Split and Rephrase

Description

Instrallation & Preparation

Training

License

About

Releases

Packages

Languages

License

nttcslab-nlp/wikisplit-pp

Folders and files

Latest commit

History

Repository files navigation

WikiSplit++: Easy Data Refinement for Split and Rephrase

Description

Instrallation & Preparation

Training

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages