diff --git a/README.md b/README.md index cf8cfd17c..a19ac6f41 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,13 @@ This code produces the non-anonymized version of the CNN / Daily Mail summarizat # Instructions -1. Download and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail. -2. We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile: +## 1. Download data +Download and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail. + +**Warning:** These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example `cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story`. The [Tensorflow code](https://github.com/abisee/pointer-generator) has been updated to discard these examples. + +## 2. Download Stanford CoreNLP +We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile: ``` export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar ``` @@ -20,7 +25,8 @@ text . PTBTokenizer tokenized 5 tokens at 68.97 tokens per second. ``` -3. Run +## 3. Process into .bin and vocab files +Run ``` python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories ``` @@ -28,5 +34,5 @@ replacing `/path/to/cnn/stories` with the path to where you saved the `cnn/stori This script will do several things: * The directories `cnn_stories_tokenized` and `dm_stories_tokenized` will be created and filled with tokenized versions of `cnn/stories` and `dailymail/stories`. This may take some time. -* For each of the url lists `all_train.txt`, `all_val.txt` and `all_test.txt`, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files `train.bin`, `val.bin` and `test.bin`, will be placed in the newly-created `finished_files` directory. This may take some time. +* For each of the url lists `all_train.txt`, `all_val.txt` and `all_test.txt`, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files `train.bin`, `val.bin` and `test.bin`. These will be placed in the newly-created `finished_files` directory. This may take some time. * Additionally, a `vocab` file is created from the training data. This is also placed in `finished_files`.