Merge branch 'master' of https://github.com/abisee/cnn-dailymail

abisee · May 4, 2017 · c870f4b · c870f4b
2 parents 2a03c32 + dd5423a
commit c870f4b
Showing 1 changed file with 10 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -2,8 +2,13 @@ This code produces the non-anonymized version of the CNN / Daily Mail summarizat
 
 # Instructions
 
-1. Download and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail.
-2. We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:
+## 1. Download data
+Download and unzip the `stories` directories from [here](http://cs.nyu.edu/~kcho/DMQA/) for both CNN and Daily Mail. 
+
+**Warning:** These files contain a few (114, in a dataset of over 300,000) examples for which the article text is missing - see for example `cnn/stories/72aba2f58178f2d19d3fae89d5f3e9a4686bc4bb.story`. The [Tensorflow code](https://github.com/abisee/pointer-generator) has been updated to discard these examples.
+
+## 2. Download Stanford CoreNLP
+We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:
 ```
 export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
 ```
@@ -20,13 +25,14 @@ text
 .
 PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.
 ```
-3. Run
+## 3. Process into .bin and vocab files
+Run
 ```
 python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
 ```
 replacing `/path/to/cnn/stories` with the path to where you saved the `cnn/stories` directory that you downloaded; similarly for `dailymail/stories`.
 
 This script will do several things:
 * The directories `cnn_stories_tokenized` and `dm_stories_tokenized` will be created and filled with tokenized versions of `cnn/stories` and `dailymail/stories`. This may take some time.
-* For each of the url lists `all_train.txt`, `all_val.txt` and `all_test.txt`, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files `train.bin`, `val.bin` and `test.bin`, will be placed in the newly-created `finished_files` directory. This may take some time.
+* For each of the url lists `all_train.txt`, `all_val.txt` and `all_test.txt`, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files `train.bin`, `val.bin` and `test.bin`. These will be placed in the newly-created `finished_files` directory. This may take some time.
 * Additionally, a `vocab` file is created from the training data. This is also placed in `finished_files`.