This repository contains the PyTorch implementation for the models and experiments in the paper Edisum: Summarizing and Explaining Wikipedia Edits at Scale
@article{šakota2024edisum,
title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale},
author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West},
journal={arXiv preprint arXiv:2404.03428}
year={2024}
}
Please consider citing our work, if you found the provided resources useful.
Start by cloning the repository:
git clone https://github.com/epfl-dlab/edisum.git
We recommend creating a new conda virtual environment as follows:
conda env create -f environment.yml
This command also installs all the necessary packages.
The data is available on huggingface and can be loaded with:
from datasets import load_dataset
dataset = load_dataset("msakota/edisum_dataset")
Alternatively, to download the collected data for the experiments, run:
bash ./download_data.sh
For downloading the trained models (available on huggingface), run:
bash ./download_models.sh
To train a model from scratch on the desired data, run:
DATA_DIR="./data/100_perc_synth_data/" # specify a directory where training data is located
RUN_NAME="train_longt5_100_synth"
python run_train.py run_name=$RUN_NAME dir=$DATA_DIR +experiment=finetune_longt5
To run inference on a trained model:
DATA_DIR="./data/100_perc_synth_data/" # specify a directory where training data is located
CHECKPOINT_PATH="./models/edisum_100.ckpt" # specify path to the trained model
RUN_NAME="inference_longt5_100_synth"
python run_inference.py run_name=$RUN_NAME dir=$DATA_DIR checkpoint_path=$CHECKPOINT_PATH +experiment=inference_longt5
To test any of the trained models on an arbitrary edit diff link:
python run_model.py --model_name_or_path edisum_100 --diff_link "https://en.wikipedia.org/w/index.php?title=C/2023_A3_(Tsuchinshan–ATLAS)&diff=prev&oldid=1251441412"
Optionally, you can stop the generation in case there are any node changes (as the generated edit might not reflect the changes exhaustively) by adding -prohibit_node
. If no model_name_or_path
is provided, the script defaults to edisum_100
. You can provide a path towards any .ckpt model, or specify one of the five models from the paper: [edisum_0, edisum_25, edisum_50, edisum_75, edisum_100]
, where the number represents percentage of synthetic data in the training dataset.
To test any custom input, which might not necessarily be a real edit:
python run_model.py --model_name_or_path edisum_100 --input_text <your_input_text>
For an optimal performance, the input text should be formatted in the way training data was formatted:
- Edit diff should be represented by collecting sentences that were altered, added or removed during the edit into two sets: previous (belonging to the previous revision of the page) and current sentences (belonging to the current revision of the page)
- Previous sentences should contain each sentence that was removed from the previous revision, and versions of the sentences which were altered from the previous revision
- New sentences should contain each sentence that was added to the new revision, and versions of the sentences which were altered in the new revision
- Input is then made concatenating each sentence in previous sentences, separating them with
<sent_sep>
, and adding a prefix<old_text>
. Similarly, sentences in current sentences are separated with the same<sent_sep>
and prefix<new_text>
is added. Final input is then dervied by concatenating these two repesentations.
Example:
We also provide a Jupyter notebook for experimentation with custom inputs: playground.ipynb
This project is licensed under the terms of the MIT license.