From 6165f8441773713adead7e8422bc09c87f6f1f4e Mon Sep 17 00:00:00 2001
From: VictorSanh <victorsanh@gmail.com>
Date: Sat, 3 Nov 2018 09:18:44 -0400
Subject: [PATCH] Update README.md

---
 README.md | 802 ++----------------------------------------------------
 1 file changed, 18 insertions(+), 784 deletions(-)

diff --git a/README.md b/README.md
index b6c8341550f..a6cc29a70f9 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,11 @@
-# BERT
+# PyTorch implementation of Google AI's BERT
+
 
 ## Introduction
 
 This is a PyTorch implementation of the [TensorFlow code](https://github.com/google-research/bert) released by Google AI with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
 
+
 ## Converting the TensorFlow pre-trained models to Pytorch
 
 You can convert the pre-trained weights released by GoogleAI by calling the script `convert_tf_checkpoint_to_pytorch.py`.
@@ -21,6 +23,7 @@ python convert_tf_checkpoint_to_pytorch.py \
   --pytorch_dump_path=$BERT_PYTORCH_DIR/pytorch_model.bin
 ```
 
+
 ## Fine-tuning with BERT: running the examples
 
 We showcase the same examples as in the original implementation: fine-tuning on the MRPC classification corpus and the question answering dataset SQUAD.
@@ -53,6 +56,7 @@ python run_classifier_pytorch.py \
   --output_dir /tmp/mrpc_output_pytorch/
 ```
 
+The next example fine-tunes `BERT-Base` on the SQuAD question answering task.
 
 The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory.
 *   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
@@ -62,6 +66,7 @@ The data for SQuAD can be downloaded with the following links and should be save
 
 ```shell
 export SQUAD_DIR=/path/to/SQUAD
+
 python run_squad_pytorch.py \
   --vocab_file=$BERT_BASE_DIR/vocab.txt \
   --bert_config_file=$BERT_BASE_DIR/bert_config.json \
@@ -75,797 +80,26 @@ python run_squad_pytorch.py \
   --num_train_epochs=2.0 \
   --max_seq_length=384 \
   --doc_stride=128 \
-  --output_dir=/tmp/squad_base_pytorch/
-```
-
-
-
-
-
-## Introduction
-
-**BERT**, or **B**idirectional **E**mbedding **R**epresentations from
-**T**ransformers, is a new method of pre-training language representations which
-obtains state-of-the-art results on a wide array of Natural Language Processing
-(NLP) tasks.
-
-Our academic paper which describes BERT in detail and provides full results on a
-number of tasks can be found here:
-[https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805).
-
-To give a few numbers, here are the results on the
-[SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) question answering
-task:
-
-SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM  | Test F1
-------------------------------------- | :------: | :------:
-1st Place Ensemble - BERT             | **87.4** | **93.2**
-2nd Place Ensemble - nlnet            | 86.0     | 91.7
-1st Place Single Model - BERT         | **85.1** | **91.8**
-2nd Place Single Model - nlnet        | 83.5     | 90.1
-
-And several natural language inference tasks:
-
-System                  | MultiNLI | Question NLI | SWAG
------------------------ | :------: | :----------: | :------:
-BERT                    | **86.7** | **91.1**     | **86.3**
-OpenAI GPT (Prev. SOTA) | 82.2     | 88.1         | 75.0
-
-Plus many other tasks.
-
-Moreover, these results were all obtained with almost no task-specific neural
-network architecture design.
-
-If you already know what BERT is and you just want to get started, you can
-[download the pre-trained models](#pre-trained-models) and
-[run a state-of-the-art fine-tuning](#fine-tuning-with-bert) in only a few
-minutes.
-
-## What is BERT?
-
-BERT is method of pre-training language representations, meaning that we train a
-general-purpose "language understanding" model on a large text corpus (like
-Wikipedia), and then use that model for downstream NLP tasks that we are about
-(like question answering). BERT outperforms previous methods because it is the
-first *unsupervised*, *deeply bidirectional* system for pre-training NLP.
-
-*Unsupervised* means that BERT was trained using only a plain text corpus, which
-is important because an enormous amount of plain text data is publicly available
-on the web in many languages.
-
-Pre-trained representations can also either be *context-free* or *contextual*,
-and contextual representations can further be *unidirectional* or
-*bidirectional*. Context-free models such as
-[word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) or
-[GloVe](https://nlp.stanford.edu/projects/glove/) generate a single "word
-embedding" representation for each word in the vocabulary, so `bank` would have
-the same representation in `bank deposit` and `river bank`. Contextual models
-instead generate a representation of each word that is based on the other words
-in the sentence.
-
-BERT was built upon recent work in pre-training contextual representations —
-including [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432),
-[Generative Pre-Training](https://blog.openai.com/language-unsupervised/),
-[ELMo](https://allennlp.org/elmo), and
-[ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html)
-— but crucially these models are all *unidirectional* or *shallowly
-bidirectional*. This means that each word is only contextualized using the words
-to its left (or right). For example, in the sentence `I made a bank deposit` the
-unidirectional representation of `bank` is only based on `I made a` but not
-`deposit`. Some previous work does combine the representations from separate
-left-context and right-context models, but only in a "shallow" manner. BERT
-represents "bank" using both its left and right context — `I made a ... deposit`
-— starting from the very bottom of a deep neural network, so it is *deeply
-bidirectional*.
-
-BERT uses a simple approach for this: We mask out 15% of the words in the input,
-run the entire sequence through a deep bidirectional
-[Transformer](https://arxiv.org/abs/1706.03762) encoder, and then predict only
-the masked words. For example:
-
-```
-Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
-Labels: [MASK1] = store; [MASK2] = gallon
-```
-
-In order to learn relationships between sentences, we also train on a simple
-task which can be generated from any monolingual corpus: Given two sentences `A`
-and `B`, is `B` the actual next sentence that comes after `A`, or just a random
-sentence from the corpus?
-
-```
-Sentence A: the man went to the store .
-Sentence B: he bought a gallon of milk .
-Label: IsNextSentence
-```
-
+  --output_dir=../debug_squad/
 ```
-Sentence A: the man went to the store .
-Sentence B: penguins are flightless .
-Label: NotNextSentence
-```
-
-We then train a large model (12-layer to 24-layer Transformer) on a large corpus
-(Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M
-update steps), and that's BERT.
-
-Using BERT has two stages: *Pre-training* and *fine-tuning*.
-
-**Pre-training** is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a
-one-time procedure for each language (current models are English-only, but
-multilingual models will be released in the near future). We are releasing a
-number of pre-trained models from the paper which were pre-trained at Google.
-Most NLP researchers will never need to pre-train their own model from scratch.
-
-**Fine-tuning** is inexpensive. All of the results in the paper can be
-replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU,
-starting from the exact same pre-trained model. SQuAD, for example, can be
-trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of
-91.0%, which is the single system state-of-the-art.
-
-The other important aspect of BERT is that it can be adapted to many types of
-NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on
-sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level
-(e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific
-modifications.
-
-## What has been released in this repository?
-
-We are releasing the following:
-
-*   TensorFlow code for the BERT model architecture (which is mostly a standard
-    [Transformer](https://arxiv.org/abs/1706.03762) architecture).
-*   Pre-trained checkpoints for both the lowercase and cased version of
-    `BERT-Base` and `BERT-Large` from the paper.
-*   TensorFlow code for push-button replication of the most important
-    fine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC.
-
-All of the code in this repository works out-of-the-box with CPU, GPU, and Cloud
-TPU.
-
-## Pre-trained models
 
-We are releasing the `BERT-Base` and `BERT-Large` models from the paper.
-`Uncased` means that the text has been lowercased before WordPiece tokenization,
-e.g., `John Smith` becomes `john smith`. The `Uncased` model also strips out any
-accent markers. `Cased` means that the true case and accent markers are
-preserved. Typically, the `Uncased` model is better unless you know that case
-information is important for your task (e.g., Named Entity Recognition or
-Part-of-Speech tagging).
-
-These models are all released under the same license as the source code (Apache
-2.0).
-
-The links to the models are here (right-cick, 'Save link as...' on the name):
-
-*   **[`BERT-Base, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)**:
-    12-layer, 768-hidden, 12-heads, 110M parameters
-*   **[`BERT-Large, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)**:
-    24-layer, 1024-hidden, 16-heads, 340M parameters
-*   **[`BERT-Base, Cased`](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip)**:
-    12-layer, 768-hidden, 12-heads , 110M parameters
-*   **`BERT-Large, Cased`**: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    (Not available yet. Needs to be re-generated).
-
-Each .zip file contains three items:
-
-*   A TensorFlow checkpoint (`bert_model.ckpt`) containing the pre-trained
-    weights (which is actually 3 files).
-*   A vocab file (`vocab.txt`) to map WordPiece to word id.
-*   A config file (`bert_config.json`) which specifies the hyperparameters of
-    the model.
-
-## Fine-tuning with BERT
-
-**Important**: All results on the paper were fine-tuned on a single Cloud TPU,
-which has 64GB of RAM. It is currently not possible to re-produce most of the
-`BERT-Large` results on the paper using a GPU with 12GB - 16GB of RAM, because
-the maximum batch size that can fit in memory is too small. We are working on
-adding code to this repository which allows for much larger effective batch size
-on the GPU. See the section on [out-of-memory issues](#out-of-memory-issues) for
-more details.
-
-This code was tested with TensorFlow 1.11.0. It was tested with Python2 and
-Python3 (but more thoroughly with Python2, since this is what's used internally
-in Google).
-
-The fine-tuning examples which use `BERT-Base` should be able to run on a GPU
-that has at least 12GB of RAM using the hyperparameters given.
-
-### Fine-tuning with Cloud TPUs
-
-Most of the examples below assumes that you will be running training/evaluation
-on your local machine, using a GPU like a Titan X or GTX 1080.
-
-However, if you have access to a Cloud TPU that you want to train on, just add
-the following flags to `run_classifier.py` or `run_squad.py`:
-
-```
-  --use_tpu=True \
-  --tpu_name=$TPU_NAME
-```
-
-Please see the
-[Google Cloud TPU tutorial](https://cloud.google.com/tpu/docs/tutorials/mnist)
-for how to use Cloud TPUs.
-
-On Cloud TPUs, the pretrained model and the output directory will need to be on
-Google Cloud Storage. For example, if you have a bucket named `some_bucket`, you
-might use the following flags instead:
-
-```
-  --output_dir=gs://some_bucket/my_output_dir/
-```
-
-The unzipped pre-trained model files can also be found in the Google Cloud
-Storage folder `gs://bert_models/2018_10_18`. For example:
-
-```
-export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
-```
-
-### Sentence (and sentence-pair) classification tasks
-
-Before running this example you must download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`. Next, download the `BERT-Base`
-checkpoint and unzip it to some directory `$BERT_BASE_DIR`.
-
-This example code fine-tunes `BERT-Base` on the Microsoft Research Paraphrase
-Corpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in a
-few minutes on most GPUs.
-
-```shell
-export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
-export GLUE_DIR=/path/to/glue
-
-python run_classifier.py \
-  --task_name=MRPC \
-  --do_train=true \
-  --do_eval=true \
-  --data_dir=$GLUE_DIR/MRPC \
-  --vocab_file=$BERT_BASE_DIR/vocab.txt \
-  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
-  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
-  --max_seq_length=128 \
-  --train_batch_size=32 \
-  --learning_rate=2e-5 \
-  --num_train_epochs=3.0 \
-  --output_dir=/tmp/mrpc_output/
-```
-
-You should see output like this:
-
-```
-***** Eval results *****
-  eval_accuracy = 0.845588
-  eval_loss = 0.505248
-  global_step = 343
-  loss = 0.505248
-```
-
-This means that the Dev set accuracy was 84.55%. Small sets like MRPC have a
-high variance in the Dev set accuracy, even when starting from the same
-pre-training checkpoint. If you re-run multiple times (making sure to point to
-different `output_dir`), you should see results between 84% and 88%.
-
-A few other pre-trained models are implemented off-the-shelf in
-`run_classifier.py`, so it should be straightforward to follow those examples to
-use BERT for any single-sentence or sentence-pair classification task.
-
-Note: You might see a message `Running train on CPU`. This really just means
-that it's running on something other than a Cloud TPU, which includes a GPU.
-
-### SQuAD
-
-The Stanford Question Answering Dataset (SQuAD) is a popular question answering
-benchmark dataset. BERT (at the time of the release) obtains state-of-the-art
-results on SQuAD with almost no task-specific network architecture modifications
-or data augmentation. However, it does require semi-complex data pre-processing
-and post-processing to deal with (a) the variable-length nature of SQuAD context
-paragraphs, and (b) the character-level answer annotations which are used for
-SQuAD training. This processing is implemented and documented in `run_squad.py`.
-
-To run on SQuAD, you will first need to download the dataset. The
-[SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) does not seem to
-link to the v1.1 datasets any longer, but the necessary files can be found here:
-
-*   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
-*   [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
-*   [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
-
-Download these to some directory `$SQUAD_DIR`.
-
-The state-of-the-art SQuAD results from the paper currently cannot be reproduced
-on a 12GB-16GB GPU due to memory constraints (in fact, even batch size 1 does
-not seem to fit on a 12GB GPU using `BERT-Large`). However, a reasonably strong
-`BERT-Base` model can be trained on the GPU with these hyperparameters:
-
-```shell
-python run_squad.py \
-  --vocab_file=$BERT_BASE_DIR/vocab.txt \
-  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
-  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
-  --do_train=True \
-  --train_file=$SQUAD_DIR/train-v1.1.json \
-  --do_predict=True \
-  --predict_file=$SQUAD_DIR/dev-v1.1.json \
-  --train_batch_size=12 \
-  --learning_rate=5e-5 \
-  --num_train_epochs=2.0 \
-  --max_seq_length=384 \
-  --doc_stride=128 \
-  --output_dir=/tmp/squad_base/
-```
-
-The dev set predictions will be saved into a file called `predictions.json` in
-the `output_dir`:
-
-```shell
-python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json
-```
-
-Which should produce an output like this:
-
-```shell
-{"f1": 88.41249612335034, "exact_match": 81.2488174077578}
-```
 
-You should see a result similar to the 88.5% reported in the paper for
-`BERT-Base`.
+## Comparing TensorFlow and PyTorch models
 
-If you have access to a Cloud TPU, you can train with `BERT-Large`. Here is a
-set of hyperparameters (slightly different than the paper) which consistently
-obtain around 90.5%-91.0% F1 single-system trained only on SQuAD:
+We also include [a small Notebook](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/Comparing%20TF%20and%20PT%20models.ipynb) we used to verify that the conversion of the weights to PyTorch are consistent with the original TensorFlow weights.
+Please follow the instructions in the Notebook to run it.
 
-```shell
-python run_squad.py \
-  --vocab_file=$BERT_LARGE_DIR/vocab.txt \
-  --bert_config_file=$BERT_LARGE_DIR/bert_config.json \
-  --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \
-  --do_train=True \
-  --train_file=$SQUAD_DIR/train-v1.1.json \
-  --do_predict=True \
-  --predict_file=$SQUAD_DIR/dev-v1.1.json \
-  --train_batch_size=48 \
-  --learning_rate=5e-5 \
-  --num_train_epochs=2.0 \
-  --max_seq_length=384 \
-  --doc_stride=128 \
-  --output_dir=gs://some_bucket/squad_large/ \
-  --use_tpu=True \
-  --tpu_name=$TPU_NAME
-```
-
-For example, one random run with these parameters produces the following Dev
-scores:
-
-```shell
-{"f1": 90.87081895814865, "exact_match": 84.38978240302744}
-```
-
-If you fine-tune for one epoch on
-[TriviaQA](http://nlp.cs.washington.edu/triviaqa/) before this the results will
-be even better, but you will need to convert TriviaQA into the SQuAD json
-format.
-
-### Out-of-memory issues
-
-All experiments in the paper were fine-tuned on a Cloud TPU, which has 64GB of
-device RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likely
-to encounter out-of-memory issues if you use the same hyperparameters described
-in the paper.
-
-The factors that affect memory usage are:
-
-*   **`max_seq_length`**: The released models were trained with sequence lengths
-    up to 512, but you can fine-tune with a shorter max sequence length to save
-    substantial memory. This is controlled by the `max_seq_length` flag in our
-    example code.
-
-*   **`train_batch_size`**: The memory usage is also directly proportional to
-    the batch size.
-
-*   **Model type, `BERT-Base` vs. `BERT-Large`**: The `BERT-Large` model
-    requires significantly more memory than `BERT-Base`.
-
-*   **Optimizer**: The default optimizer for BERT is Adam, which requires a lot
-    of extra memory to store the `m` and `v` vectors. Switching to a more memory
-    efficient optimizer can reduce memory usage, but can also affect the
-    results. We have not experimented with other optimizers for fine-tuning.
-
-Using the default training scripts (`run_classifier.py` and `run_squad.py`), we
-benchmarked the maximum batch size on single Titan X GPU (12GB RAM) with
-TensorFlow 1.11.0:
-
-System       | Seq Length | Max Batch Size
------------- | ---------- | --------------
-`BERT-Base`  | 64         | 64
-...          | 128        | 32
-...          | 256        | 16
-...          | 320        | 14
-...          | 384        | 12
-...          | 512        | 6
-`BERT-Large` | 64         | 12
-...          | 128        | 6
-...          | 256        | 2
-...          | 320        | 1
-...          | 384        | 0
-...          | 512        | 0
-
-Unfortunately, these max batch sizes for `BERT-Large` are so small that they
-will actually harm the model accuracy, regardless of the learning rate used. We
-are working on adding code to this repository which will allow much larger
-effective batch sizes to be used on the GPU. The code will be based on one (or
-both) of the following techniques:
-
-*   **Gradient accumulation**: The samples in a minibatch are typically
-    independent with respect to gradient computation (excluding batch
-    normalization, which is not used here). This means that the gradients of
-    multiple smaller minibatches can be accumulated before performing the weight
-    update, and this will be exactly equivalent to a single larger update.
-
-*   [**Gradient checkpointing**](https://github.com/openai/gradient-checkpointing):
-    The major use of GPU/TPU memory during DNN training is caching the
-    intermediate activations in the forward pass that are necessary for
-    efficient computation in the backward pass. "Gradient checkpointing" trades
-    memory for compute time by re-computing the activations in an intelligent
-    way.
-
-**However, this is not implemented in the current release.**
-
-## Using BERT to extract fixed feature vectors (like ELMo)
-
-In certain cases, rather than fine-tuning the entire pre-trained model
-end-to-end, it can be beneficial to obtained *pre-trained contextual
-embeddings*, which are fixed contextual representations of each input token
-generated from the hidden layers of the pre-trained model. This should also
-mitigate most of the out-of-memory issues.
-
-As an example, we include the script `extract_features.py` which can be used
-like this:
-
-```shell
-# Sentence A and Sentence B are separated by the ||| delimiter.
-# For single sentence inputs, don't use the delimiter.
-echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt
-
-python extract_features.py \
-  --input_file=/tmp/input.txt \
-  --output_file=/tmp/output.jsonl \
-  --vocab_file=$BERT_BASE_DIR/vocab.txt \
-  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
-  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
-  --layers=-1,-2,-3,-4 \
-  --max_seq_length=128 \
-  --batch_size=8
-```
-
-This will create a JSON file (one line per line of input) containing the BERT
-activations from each Transformer layer specified by `layers` (-1 is the final
-hidden layer of the Transformer, etc.)
-
-Note that this script will produce very large output files (by default, around
-15kb for every input token).
-
-If you need to maintain alignment between the original and tokenized words (for
-projecting training labels), see the [Tokenization](#tokenization) section
-below.
-
-## Tokenization
-
-For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple.
-Just follow the example code in `run_classifier.py` and `extract_features.py`.
-The basic procedure for sentence-level tasks is:
-
-1.  Instantiate an instance of `tokenizer = tokenization.FullTokenizer`
-
-2.  Tokenize the raw text with `tokens = tokenizer.tokenize(raw_text)`.
-
-3.  Truncate to the maximum sequence length. (You can use up to 512, but you
-    probably want to use shorter if possible for memory and speed reasons.)
-
-4.  Add the `[CLS]` and `[SEP]` tokens in the right place.
-
-Word-level and span-level tasks (e.g., SQuAD and NER) are more complex, since
-you need to maintain alignment between your input text and output text so that
-you can project your training labels. SQuAD is a particularly complex example
-because the input labels are *character*-based, and SQuAD paragraphs are often
-longer than our maximum sequence length. See the code in `run_squad.py` to show
-how we handle this.
-
-Before we describe the general recipe for handling word-level tasks, it's
-important to understand what exactly our tokenizer is doing. It has three main
-steps:
-
-1.  **Text normalization**: Convert all whitespace characters to spaces, and
-    (for the `Uncased` model) lowercase the input and strip out accent markers.
-    E.g., `John Johanson's, → john johanson's,`.
-
-2.  **Punctuation splitting**: Split *all* punctuation characters on both sides
-    (i.e., add whitespace around all punctuation characters). Punctuation
-    characters are defined as (a) Anything with a `P*` Unicode class, (b) any
-    non-letter/number/space ASCII character (e.g., characters like `$` which are
-    technically not punctuation). E.g., `john johanson's, → john johanson ' s ,`
-
-3.  **WordPiece tokenization**: Apply whitespace tokenization to the output of
-    the above procedure, and apply
-    [WordPiece](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py)
-    tokenization to each token separately. (Our implementation is directly based
-    on the one from `tensor2tensor`, which is linked). E.g., `john johanson ' s
-    , → john johan ##son ' s ,`
-
-The advantage of this scheme is that it is "compatible" with most existing
-English tokenizers. For example, imagine that you have a part-of-speech tagging
-task which looks like this:
-
-```
-Input:  John Johanson 's   house
-Labels: NNP  NNP      POS NN
-```
-
-The tokenized output will look like this:
-
-```
-Tokens: john johan ##son ' s house
-```
-
-Crucially, this would be the same output as if the raw text were `John
-Johanson's house` (with no space before the `'s`).
-
-If you have a pre-tokenized representation with word-level annotations, you can
-simply tokenize each input word independently, and deterministically maintain an
-original-to-tokenized alignment:
-
-```python
-### Input
-orig_tokens = ["John", "Johanson", "'s",  "house"]
-labels      = ["NNP",  "NNP",      "POS", "NN"]
-
-### Output
-bert_tokens = []
-
-# Token map will be an int -> int mapping between the `orig_tokens` index and
-# the `bert_tokens` index.
-orig_to_tok_map = []
-
-tokenizer = tokenization.FullTokenizer(
-    vocab_file=vocab_file, do_lower_case=True)
-
-bert_tokens.append("[CLS]")
-for orig_token in orig_tokens:
-  orig_to_tok_map.append(len(bert_tokens))
-  bert_tokens.extend(tokenizer.tokenize(orig_token))
-bert_tokens.append("[SEP]")
-
-# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
-# orig_to_tok_map == [1, 2, 4, 6]
-```
-
-Now `orig_to_tok_map` can be used to project `labels` to the tokenized
-representation.
-
-There are common English tokenization schemes which will cause a slight mismatch
-between how BERT was pre-trained. For example, if your input tokenization splits
-off contractions like `do n't`, this will cause a mismatch. If it is possible to
-do so, you should pre-process your data to convert these back to raw-looking
-text, but if it's not possible, this mismatch is likely not a big deal.
-
-## Pre-training with BERT
-
-We are releasing code to do "masked LM" and "next sentence prediction" on an
-arbitrary text corpus. Note that this is *not* the exact code that was used for
-the paper (the original code was written in C++, and had some additional
-complexity), but this code does generate pre-training data as described in the
-paper.
-
-Here's how to run the data generation. The input is a plain text file, with one
-sentence per line. (It is important that these be actual sentences for the "next
-sentence prediction" task). Documents are delimited by empty lines. The output
-is a set of `tf.train.Example`s serialized into `TFRecord` file format.
-
-This script stores all of the examples for the entire input file in memory, so
-for large data files you should shard the input file and call the script
-multiple times. (You can pass in a file glob to `run_pretraining.py`, e.g.,
-`tf_examples.tf_record*`.)
-
-The `max_predictions_per_seq` is the maximum number of masked LM predictions per
-sequence. You should set this to around `max_seq_length` * `masked_lm_prob` (the
-script doesn't do that automatically because the exact value needs to be passed
-to both scripts).
-
-```shell
-python create_pretraining_data.py \
-  --input_file=./sample_text.txt \
-  --output_file=/tmp/tf_examples.tfrecord \
-  --vocab_file=$BERT_BASE_DIR/vocab.txt \
-  --do_lower_case=True \
-  --max_seq_length=128 \
-  --max_predictions_per_seq=20 \
-  --masked_lm_prob=0.15 \
-  --random_seed=12345 \
-  --dupe_factor=5
-```
-
-Here's how to run the pre-training. Do not include `init_checkpoint` if you are
-pre-training from scratch. The model configuration (including vocab size) is
-specified in `bert_config_file`. This demo code only pre-trains for a small
-number of steps (20), but in practice you will probably want to set
-`num_train_steps` to 10000 steps or more. The `max_seq_length` and
-`max_predictions_per_seq` parameters passed to `run_pretraining.py` must be the
-same as `create_pretraining_data.py`.
-
-```shell
-python run_pretraining.py \
-  --input_file=/tmp/tf_examples.tfrecord \
-  --output_dir=/tmp/pretraining_output \
-  --do_train=True \
-  --do_eval=True \
-  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
-  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
-  --train_batch_size=32 \
-  --max_seq_length=128 \
-  --max_predictions_per_seq=20 \
-  --num_train_steps=20 \
-  --num_warmup_steps=10 \
-  --learning_rate=2e-5
-```
-
-This will produce an output like this:
-
-```
-***** Eval results *****
-  global_step = 20
-  loss = 0.0979674
-  masked_lm_accuracy = 0.985479
-  masked_lm_loss = 0.0979328
-  next_sentence_accuracy = 1.0
-  next_sentence_loss = 3.45724e-05
-```
-
-Note that since our `sample_text.txt` file is very small, this example training
-will overfit that data in only a few steps and produce unrealistically high
-accuracy numbers.
-
-### Pre-training tips and caveats
-
-*   If your task has a large domain-specific corpus available (e.g., "movie
-    reviews" or "scientific papers"), it will likely be beneficial to run
-    additional steps of pre-training on your corpus, starting from the BERT
-    checkpoint.
-*   The learning rate we used in the paper was 1e-4. However, if you are doing
-    additional steps of pre-training starting from an existing BERT checkpoint,
-    you should use a smaller learning rate (e.g., 2e-5).
-*   Current BERT models are English-only, but we do plan to release a
-    multilingual model which has been pre-trained on a lot of languages in the
-    near future (hopefully by the end of November 2018).
-*   Longer sequences are disproportionately expensive because attention is
-    quadratic to the sequence length. In other words, a batch of 64 sequences of
-    length 512 is much more expensive than a batch of 256 sequences of
-    length 128. The fully-connected/convolutional cost is the same, but the
-    attention cost is far greater for the 512-length sequences. Therefore, one
-    good recipe is to pre-train for, say, 90,000 steps with a sequence length of
-    128 and then for 10,000 additional steps with a sequence length of 512. The
-    very long sequences are mostly needed to learn positional embeddings, which
-    can be learned fairly quickly. Note that this does require generating the
-    data twice with different values of `max_seq_length`.
-*   If you are pre-training from scratch, be prepared that pre-training is
-    computationally expensive, especially on GPUs. If you are pre-training from
-    scratch, our recommended recipe is to pre-train a `BERT-Base` on a single
-    [preemptable Cloud TPU v2](https://cloud.google.com/tpu/docs/pricing), which
-    takes about 2 weeks at a cost of about $500 USD (based on the pricing in
-    October 2018). You will have to scale down the batch size when only training
-    on a single Cloud TPU, compared to what was used in the paper. It is
-    recommended to use the largest batch size that fits into TPU memory.
-
-### Pre-training data
-
-We will **not** be able to release the pre-processed datasets used in the paper.
-For Wikipedia, the recommended pre-processing is to download
-[the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2),
-extract the text with
-[`WikiExtractor.py`](https://github.com/attardi/wikiextractor), and then apply
-any necessary cleanup to convert it into plain text.
-
-Unfortunately the researchers who collected the
-[BookCorpus](http://yknzhu.wixsite.com/mbweb) no longer have it available for
-public download. The
-[Project Guttenberg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html)
-is a somewhat smaller (200M word) collection of older books that are public
-domain.
-
-[Common Crawl](http://commoncrawl.org/) is another very large collection of
-text, but you will likely have to do substantial pre-processing and cleanup to
-extract a usuable corpus for pre-training BERT.
-
-### Learning a new WordPiece vocabulary
-
-This repository does not include code for *learning* a new WordPiece vocabulary.
-The reason is that the code used in the paper was implemented in C++ with
-dependencies on Google's internal libraries. For English, it is almost always
-better to just start with our vocabulary and pre-trained models. For learning
-vocabularies of other languages, there are a number of open source options
-available. However, keep in mind that these are not compatible with our
-`tokenization.py` library:
-
-*   [Google's SentencePiece library](https://github.com/google/sentencepiece)
-
-*   [tensor2tensor's WordPiece generation script](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder_build_subword.py)
-
-*   [Rico Sennrich's Byte Pair Encoding library](https://github.com/rsennrich/subword-nmt)
-
-## Using BERT in Colab
-
-If you want to use BERT with [Colab](https://colab.sandbox.google.com), you can
-get started with the notebook
-"[BERT FineTuning with Cloud TPUs](https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)".
-**At the time of this writing (October 31st, 2018), Colab users can access a
-Cloud TPU completely for free.** Note: One per user, availability limited,
-requires a Google Cloud Platform account with storage (although storage may be
-purchased with free credit for signing up with GCP), and this capability may not
-longer be available in the future. Click on the BERT Colab that was just linked
-for more information.
-
-## FAQ
-
-#### Is this code compatible with Cloud TPUs? What about GPUs?
-
-Yes, all of the code in this repository works out-of-the-box with CPU, GPU, and
-Cloud TPU. However, GPU training is single-GPU only.
-
-#### I am getting out-of-memory errors, what is wrong?
-
-See the section on [out-of-memory issues](#out-of-memory-issues) for more
-information.
-
-#### Is there a PyTorch version available?
-
-There is no official PyTorch implementation. If someone creates a line-for-line
-PyTorch reimplementation so that our pre-trained checkpoints can be directly
-converted, we would be happy to link to that PyTorch version here.
-
-#### Will models in other languages be released?
-
-Yes, we plan to release a multi-lingual BERT model in the near future. We cannot
-make promises about exactly which languages will be included, but it will likely
-be a single model which includes *most* of the languages which have a
-significantly-sized Wikipedia.
-
-#### Will models larger than `BERT-Large` be released?
-
-So far we have not attempted to train anything larger than `BERT-Large`. It is
-possible that we will release larger models if we are able to obtain significant
-improvements.
-
-#### What license is this library released under?
-
-All code *and* models are released under the Apache 2.0 license. See the
-`LICENSE` file for more information.
-
-#### How do I cite BERT?
-
-For now, cite [the Arxiv paper](https://arxiv.org/abs/1810.04805):
-
-```
-@article{devlin2018bert,
-  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
-  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
-  journal={arXiv preprint arXiv:1810.04805},
-  year={2018}
-}
-```
 
-If we submit the paper to a conference or journal, we will update the BibTeX.
+## Note on pre-training
 
-## Disclaimer
+The original TensorFlow code also release two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py).
+As the authors notice, pre-training BERT is particularly expensive and requires TPU to run in a reasonable amout of time (see [here](https://github.com/google-research/bert#pre-training-with-bert)).
 
-This is not an official Google product.
+We have decided **not** to port these scripts for now and wait for the TPU support on PyTorch (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)).
 
-## Contact information
 
-For help or issues using BERT, please submit a GitHub issue.
+## Requirements
 
-For personal communication related to BERT, please contact Jacob Devlin
-(`jacobdevlin@google.com`), Ming-Wei Chang (`mingweichang@google.com`), or
-Kenton Lee (`kentonl@google.com`).
+The main dependencies of this code are:
+- PyTorch (>= 0.4.0)
+- tqdm
\ No newline at end of file