From 6165f8441773713adead7e8422bc09c87f6f1f4e Mon Sep 17 00:00:00 2001 From: VictorSanh Date: Sat, 3 Nov 2018 09:18:44 -0400 Subject: [PATCH] Update README.md --- README.md | 802 ++---------------------------------------------------- 1 file changed, 18 insertions(+), 784 deletions(-) diff --git a/README.md b/README.md index b6c8341550f..a6cc29a70f9 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,11 @@ -# BERT +# PyTorch implementation of Google AI's BERT + ## Introduction This is a PyTorch implementation of the [TensorFlow code](https://github.com/google-research/bert) released by Google AI with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). + ## Converting the TensorFlow pre-trained models to Pytorch You can convert the pre-trained weights released by GoogleAI by calling the script `convert_tf_checkpoint_to_pytorch.py`. @@ -21,6 +23,7 @@ python convert_tf_checkpoint_to_pytorch.py \ --pytorch_dump_path=$BERT_PYTORCH_DIR/pytorch_model.bin ``` + ## Fine-tuning with BERT: running the examples We showcase the same examples as in the original implementation: fine-tuning on the MRPC classification corpus and the question answering dataset SQUAD. @@ -53,6 +56,7 @@ python run_classifier_pytorch.py \ --output_dir /tmp/mrpc_output_pytorch/ ``` +The next example fine-tunes `BERT-Base` on the SQuAD question answering task. The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory. * [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) @@ -62,6 +66,7 @@ The data for SQuAD can be downloaded with the following links and should be save ```shell export SQUAD_DIR=/path/to/SQUAD + python run_squad_pytorch.py \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ @@ -75,797 +80,26 @@ python run_squad_pytorch.py \ --num_train_epochs=2.0 \ --max_seq_length=384 \ --doc_stride=128 \ - --output_dir=/tmp/squad_base_pytorch/ -``` - - - - - -## Introduction - -**BERT**, or **B**idirectional **E**mbedding **R**epresentations from -**T**ransformers, is a new method of pre-training language representations which -obtains state-of-the-art results on a wide array of Natural Language Processing -(NLP) tasks. - -Our academic paper which describes BERT in detail and provides full results on a -number of tasks can be found here: -[https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805). - -To give a few numbers, here are the results on the -[SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) question answering -task: - -SQuAD v1.1 Leaderboard (Oct 8th 2018) | Test EM | Test F1 -------------------------------------- | :------: | :------: -1st Place Ensemble - BERT | **87.4** | **93.2** -2nd Place Ensemble - nlnet | 86.0 | 91.7 -1st Place Single Model - BERT | **85.1** | **91.8** -2nd Place Single Model - nlnet | 83.5 | 90.1 - -And several natural language inference tasks: - -System | MultiNLI | Question NLI | SWAG ------------------------ | :------: | :----------: | :------: -BERT | **86.7** | **91.1** | **86.3** -OpenAI GPT (Prev. SOTA) | 82.2 | 88.1 | 75.0 - -Plus many other tasks. - -Moreover, these results were all obtained with almost no task-specific neural -network architecture design. - -If you already know what BERT is and you just want to get started, you can -[download the pre-trained models](#pre-trained-models) and -[run a state-of-the-art fine-tuning](#fine-tuning-with-bert) in only a few -minutes. - -## What is BERT? - -BERT is method of pre-training language representations, meaning that we train a -general-purpose "language understanding" model on a large text corpus (like -Wikipedia), and then use that model for downstream NLP tasks that we are about -(like question answering). BERT outperforms previous methods because it is the -first *unsupervised*, *deeply bidirectional* system for pre-training NLP. - -*Unsupervised* means that BERT was trained using only a plain text corpus, which -is important because an enormous amount of plain text data is publicly available -on the web in many languages. - -Pre-trained representations can also either be *context-free* or *contextual*, -and contextual representations can further be *unidirectional* or -*bidirectional*. Context-free models such as -[word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) or -[GloVe](https://nlp.stanford.edu/projects/glove/) generate a single "word -embedding" representation for each word in the vocabulary, so `bank` would have -the same representation in `bank deposit` and `river bank`. Contextual models -instead generate a representation of each word that is based on the other words -in the sentence. - -BERT was built upon recent work in pre-training contextual representations — -including [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432), -[Generative Pre-Training](https://blog.openai.com/language-unsupervised/), -[ELMo](https://allennlp.org/elmo), and -[ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html) -— but crucially these models are all *unidirectional* or *shallowly -bidirectional*. This means that each word is only contextualized using the words -to its left (or right). For example, in the sentence `I made a bank deposit` the -unidirectional representation of `bank` is only based on `I made a` but not -`deposit`. Some previous work does combine the representations from separate -left-context and right-context models, but only in a "shallow" manner. BERT -represents "bank" using both its left and right context — `I made a ... deposit` -— starting from the very bottom of a deep neural network, so it is *deeply -bidirectional*. - -BERT uses a simple approach for this: We mask out 15% of the words in the input, -run the entire sequence through a deep bidirectional -[Transformer](https://arxiv.org/abs/1706.03762) encoder, and then predict only -the masked words. For example: - -``` -Input: the man went to the [MASK1] . he bought a [MASK2] of milk. -Labels: [MASK1] = store; [MASK2] = gallon -``` - -In order to learn relationships between sentences, we also train on a simple -task which can be generated from any monolingual corpus: Given two sentences `A` -and `B`, is `B` the actual next sentence that comes after `A`, or just a random -sentence from the corpus? - -``` -Sentence A: the man went to the store . -Sentence B: he bought a gallon of milk . -Label: IsNextSentence -``` - + --output_dir=../debug_squad/ ``` -Sentence A: the man went to the store . -Sentence B: penguins are flightless . -Label: NotNextSentence -``` - -We then train a large model (12-layer to 24-layer Transformer) on a large corpus -(Wikipedia + [BookCorpus](http://yknzhu.wixsite.com/mbweb)) for a long time (1M -update steps), and that's BERT. - -Using BERT has two stages: *Pre-training* and *fine-tuning*. - -**Pre-training** is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a -one-time procedure for each language (current models are English-only, but -multilingual models will be released in the near future). We are releasing a -number of pre-trained models from the paper which were pre-trained at Google. -Most NLP researchers will never need to pre-train their own model from scratch. - -**Fine-tuning** is inexpensive. All of the results in the paper can be -replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, -starting from the exact same pre-trained model. SQuAD, for example, can be -trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of -91.0%, which is the single system state-of-the-art. - -The other important aspect of BERT is that it can be adapted to many types of -NLP tasks very easily. In the paper, we demonstrate state-of-the-art results on -sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level -(e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific -modifications. - -## What has been released in this repository? - -We are releasing the following: - -* TensorFlow code for the BERT model architecture (which is mostly a standard - [Transformer](https://arxiv.org/abs/1706.03762) architecture). -* Pre-trained checkpoints for both the lowercase and cased version of - `BERT-Base` and `BERT-Large` from the paper. -* TensorFlow code for push-button replication of the most important - fine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC. - -All of the code in this repository works out-of-the-box with CPU, GPU, and Cloud -TPU. - -## Pre-trained models -We are releasing the `BERT-Base` and `BERT-Large` models from the paper. -`Uncased` means that the text has been lowercased before WordPiece tokenization, -e.g., `John Smith` becomes `john smith`. The `Uncased` model also strips out any -accent markers. `Cased` means that the true case and accent markers are -preserved. Typically, the `Uncased` model is better unless you know that case -information is important for your task (e.g., Named Entity Recognition or -Part-of-Speech tagging). - -These models are all released under the same license as the source code (Apache -2.0). - -The links to the models are here (right-cick, 'Save link as...' on the name): - -* **[`BERT-Base, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)**: - 12-layer, 768-hidden, 12-heads, 110M parameters -* **[`BERT-Large, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)**: - 24-layer, 1024-hidden, 16-heads, 340M parameters -* **[`BERT-Base, Cased`](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip)**: - 12-layer, 768-hidden, 12-heads , 110M parameters -* **`BERT-Large, Cased`**: 24-layer, 1024-hidden, 16-heads, 340M parameters - (Not available yet. Needs to be re-generated). - -Each .zip file contains three items: - -* A TensorFlow checkpoint (`bert_model.ckpt`) containing the pre-trained - weights (which is actually 3 files). -* A vocab file (`vocab.txt`) to map WordPiece to word id. -* A config file (`bert_config.json`) which specifies the hyperparameters of - the model. - -## Fine-tuning with BERT - -**Important**: All results on the paper were fine-tuned on a single Cloud TPU, -which has 64GB of RAM. It is currently not possible to re-produce most of the -`BERT-Large` results on the paper using a GPU with 12GB - 16GB of RAM, because -the maximum batch size that can fit in memory is too small. We are working on -adding code to this repository which allows for much larger effective batch size -on the GPU. See the section on [out-of-memory issues](#out-of-memory-issues) for -more details. - -This code was tested with TensorFlow 1.11.0. It was tested with Python2 and -Python3 (but more thoroughly with Python2, since this is what's used internally -in Google). - -The fine-tuning examples which use `BERT-Base` should be able to run on a GPU -that has at least 12GB of RAM using the hyperparameters given. - -### Fine-tuning with Cloud TPUs - -Most of the examples below assumes that you will be running training/evaluation -on your local machine, using a GPU like a Titan X or GTX 1080. - -However, if you have access to a Cloud TPU that you want to train on, just add -the following flags to `run_classifier.py` or `run_squad.py`: - -``` - --use_tpu=True \ - --tpu_name=$TPU_NAME -``` - -Please see the -[Google Cloud TPU tutorial](https://cloud.google.com/tpu/docs/tutorials/mnist) -for how to use Cloud TPUs. - -On Cloud TPUs, the pretrained model and the output directory will need to be on -Google Cloud Storage. For example, if you have a bucket named `some_bucket`, you -might use the following flags instead: - -``` - --output_dir=gs://some_bucket/my_output_dir/ -``` - -The unzipped pre-trained model files can also be found in the Google Cloud -Storage folder `gs://bert_models/2018_10_18`. For example: - -``` -export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12 -``` - -### Sentence (and sentence-pair) classification tasks - -Before running this example you must download the -[GLUE data](https://gluebenchmark.com/tasks) by running -[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) -and unpack it to some directory `$GLUE_DIR`. Next, download the `BERT-Base` -checkpoint and unzip it to some directory `$BERT_BASE_DIR`. - -This example code fine-tunes `BERT-Base` on the Microsoft Research Paraphrase -Corpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in a -few minutes on most GPUs. - -```shell -export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 -export GLUE_DIR=/path/to/glue - -python run_classifier.py \ - --task_name=MRPC \ - --do_train=true \ - --do_eval=true \ - --data_dir=$GLUE_DIR/MRPC \ - --vocab_file=$BERT_BASE_DIR/vocab.txt \ - --bert_config_file=$BERT_BASE_DIR/bert_config.json \ - --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ - --max_seq_length=128 \ - --train_batch_size=32 \ - --learning_rate=2e-5 \ - --num_train_epochs=3.0 \ - --output_dir=/tmp/mrpc_output/ -``` - -You should see output like this: - -``` -***** Eval results ***** - eval_accuracy = 0.845588 - eval_loss = 0.505248 - global_step = 343 - loss = 0.505248 -``` - -This means that the Dev set accuracy was 84.55%. Small sets like MRPC have a -high variance in the Dev set accuracy, even when starting from the same -pre-training checkpoint. If you re-run multiple times (making sure to point to -different `output_dir`), you should see results between 84% and 88%. - -A few other pre-trained models are implemented off-the-shelf in -`run_classifier.py`, so it should be straightforward to follow those examples to -use BERT for any single-sentence or sentence-pair classification task. - -Note: You might see a message `Running train on CPU`. This really just means -that it's running on something other than a Cloud TPU, which includes a GPU. - -### SQuAD - -The Stanford Question Answering Dataset (SQuAD) is a popular question answering -benchmark dataset. BERT (at the time of the release) obtains state-of-the-art -results on SQuAD with almost no task-specific network architecture modifications -or data augmentation. However, it does require semi-complex data pre-processing -and post-processing to deal with (a) the variable-length nature of SQuAD context -paragraphs, and (b) the character-level answer annotations which are used for -SQuAD training. This processing is implemented and documented in `run_squad.py`. - -To run on SQuAD, you will first need to download the dataset. The -[SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) does not seem to -link to the v1.1 datasets any longer, but the necessary files can be found here: - -* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json) -* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) -* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) - -Download these to some directory `$SQUAD_DIR`. - -The state-of-the-art SQuAD results from the paper currently cannot be reproduced -on a 12GB-16GB GPU due to memory constraints (in fact, even batch size 1 does -not seem to fit on a 12GB GPU using `BERT-Large`). However, a reasonably strong -`BERT-Base` model can be trained on the GPU with these hyperparameters: - -```shell -python run_squad.py \ - --vocab_file=$BERT_BASE_DIR/vocab.txt \ - --bert_config_file=$BERT_BASE_DIR/bert_config.json \ - --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ - --do_train=True \ - --train_file=$SQUAD_DIR/train-v1.1.json \ - --do_predict=True \ - --predict_file=$SQUAD_DIR/dev-v1.1.json \ - --train_batch_size=12 \ - --learning_rate=5e-5 \ - --num_train_epochs=2.0 \ - --max_seq_length=384 \ - --doc_stride=128 \ - --output_dir=/tmp/squad_base/ -``` - -The dev set predictions will be saved into a file called `predictions.json` in -the `output_dir`: - -```shell -python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json -``` - -Which should produce an output like this: - -```shell -{"f1": 88.41249612335034, "exact_match": 81.2488174077578} -``` -You should see a result similar to the 88.5% reported in the paper for -`BERT-Base`. +## Comparing TensorFlow and PyTorch models -If you have access to a Cloud TPU, you can train with `BERT-Large`. Here is a -set of hyperparameters (slightly different than the paper) which consistently -obtain around 90.5%-91.0% F1 single-system trained only on SQuAD: +We also include [a small Notebook](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/Comparing%20TF%20and%20PT%20models.ipynb) we used to verify that the conversion of the weights to PyTorch are consistent with the original TensorFlow weights. +Please follow the instructions in the Notebook to run it. -```shell -python run_squad.py \ - --vocab_file=$BERT_LARGE_DIR/vocab.txt \ - --bert_config_file=$BERT_LARGE_DIR/bert_config.json \ - --init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt \ - --do_train=True \ - --train_file=$SQUAD_DIR/train-v1.1.json \ - --do_predict=True \ - --predict_file=$SQUAD_DIR/dev-v1.1.json \ - --train_batch_size=48 \ - --learning_rate=5e-5 \ - --num_train_epochs=2.0 \ - --max_seq_length=384 \ - --doc_stride=128 \ - --output_dir=gs://some_bucket/squad_large/ \ - --use_tpu=True \ - --tpu_name=$TPU_NAME -``` - -For example, one random run with these parameters produces the following Dev -scores: - -```shell -{"f1": 90.87081895814865, "exact_match": 84.38978240302744} -``` - -If you fine-tune for one epoch on -[TriviaQA](http://nlp.cs.washington.edu/triviaqa/) before this the results will -be even better, but you will need to convert TriviaQA into the SQuAD json -format. - -### Out-of-memory issues - -All experiments in the paper were fine-tuned on a Cloud TPU, which has 64GB of -device RAM. Therefore, when using a GPU with 12GB - 16GB of RAM, you are likely -to encounter out-of-memory issues if you use the same hyperparameters described -in the paper. - -The factors that affect memory usage are: - -* **`max_seq_length`**: The released models were trained with sequence lengths - up to 512, but you can fine-tune with a shorter max sequence length to save - substantial memory. This is controlled by the `max_seq_length` flag in our - example code. - -* **`train_batch_size`**: The memory usage is also directly proportional to - the batch size. - -* **Model type, `BERT-Base` vs. `BERT-Large`**: The `BERT-Large` model - requires significantly more memory than `BERT-Base`. - -* **Optimizer**: The default optimizer for BERT is Adam, which requires a lot - of extra memory to store the `m` and `v` vectors. Switching to a more memory - efficient optimizer can reduce memory usage, but can also affect the - results. We have not experimented with other optimizers for fine-tuning. - -Using the default training scripts (`run_classifier.py` and `run_squad.py`), we -benchmarked the maximum batch size on single Titan X GPU (12GB RAM) with -TensorFlow 1.11.0: - -System | Seq Length | Max Batch Size ------------- | ---------- | -------------- -`BERT-Base` | 64 | 64 -... | 128 | 32 -... | 256 | 16 -... | 320 | 14 -... | 384 | 12 -... | 512 | 6 -`BERT-Large` | 64 | 12 -... | 128 | 6 -... | 256 | 2 -... | 320 | 1 -... | 384 | 0 -... | 512 | 0 - -Unfortunately, these max batch sizes for `BERT-Large` are so small that they -will actually harm the model accuracy, regardless of the learning rate used. We -are working on adding code to this repository which will allow much larger -effective batch sizes to be used on the GPU. The code will be based on one (or -both) of the following techniques: - -* **Gradient accumulation**: The samples in a minibatch are typically - independent with respect to gradient computation (excluding batch - normalization, which is not used here). This means that the gradients of - multiple smaller minibatches can be accumulated before performing the weight - update, and this will be exactly equivalent to a single larger update. - -* [**Gradient checkpointing**](https://github.com/openai/gradient-checkpointing): - The major use of GPU/TPU memory during DNN training is caching the - intermediate activations in the forward pass that are necessary for - efficient computation in the backward pass. "Gradient checkpointing" trades - memory for compute time by re-computing the activations in an intelligent - way. - -**However, this is not implemented in the current release.** - -## Using BERT to extract fixed feature vectors (like ELMo) - -In certain cases, rather than fine-tuning the entire pre-trained model -end-to-end, it can be beneficial to obtained *pre-trained contextual -embeddings*, which are fixed contextual representations of each input token -generated from the hidden layers of the pre-trained model. This should also -mitigate most of the out-of-memory issues. - -As an example, we include the script `extract_features.py` which can be used -like this: - -```shell -# Sentence A and Sentence B are separated by the ||| delimiter. -# For single sentence inputs, don't use the delimiter. -echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt - -python extract_features.py \ - --input_file=/tmp/input.txt \ - --output_file=/tmp/output.jsonl \ - --vocab_file=$BERT_BASE_DIR/vocab.txt \ - --bert_config_file=$BERT_BASE_DIR/bert_config.json \ - --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ - --layers=-1,-2,-3,-4 \ - --max_seq_length=128 \ - --batch_size=8 -``` - -This will create a JSON file (one line per line of input) containing the BERT -activations from each Transformer layer specified by `layers` (-1 is the final -hidden layer of the Transformer, etc.) - -Note that this script will produce very large output files (by default, around -15kb for every input token). - -If you need to maintain alignment between the original and tokenized words (for -projecting training labels), see the [Tokenization](#tokenization) section -below. - -## Tokenization - -For sentence-level tasks (or sentence-pair) tasks, tokenization is very simple. -Just follow the example code in `run_classifier.py` and `extract_features.py`. -The basic procedure for sentence-level tasks is: - -1. Instantiate an instance of `tokenizer = tokenization.FullTokenizer` - -2. Tokenize the raw text with `tokens = tokenizer.tokenize(raw_text)`. - -3. Truncate to the maximum sequence length. (You can use up to 512, but you - probably want to use shorter if possible for memory and speed reasons.) - -4. Add the `[CLS]` and `[SEP]` tokens in the right place. - -Word-level and span-level tasks (e.g., SQuAD and NER) are more complex, since -you need to maintain alignment between your input text and output text so that -you can project your training labels. SQuAD is a particularly complex example -because the input labels are *character*-based, and SQuAD paragraphs are often -longer than our maximum sequence length. See the code in `run_squad.py` to show -how we handle this. - -Before we describe the general recipe for handling word-level tasks, it's -important to understand what exactly our tokenizer is doing. It has three main -steps: - -1. **Text normalization**: Convert all whitespace characters to spaces, and - (for the `Uncased` model) lowercase the input and strip out accent markers. - E.g., `John Johanson's, → john johanson's,`. - -2. **Punctuation splitting**: Split *all* punctuation characters on both sides - (i.e., add whitespace around all punctuation characters). Punctuation - characters are defined as (a) Anything with a `P*` Unicode class, (b) any - non-letter/number/space ASCII character (e.g., characters like `$` which are - technically not punctuation). E.g., `john johanson's, → john johanson ' s ,` - -3. **WordPiece tokenization**: Apply whitespace tokenization to the output of - the above procedure, and apply - [WordPiece](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py) - tokenization to each token separately. (Our implementation is directly based - on the one from `tensor2tensor`, which is linked). E.g., `john johanson ' s - , → john johan ##son ' s ,` - -The advantage of this scheme is that it is "compatible" with most existing -English tokenizers. For example, imagine that you have a part-of-speech tagging -task which looks like this: - -``` -Input: John Johanson 's house -Labels: NNP NNP POS NN -``` - -The tokenized output will look like this: - -``` -Tokens: john johan ##son ' s house -``` - -Crucially, this would be the same output as if the raw text were `John -Johanson's house` (with no space before the `'s`). - -If you have a pre-tokenized representation with word-level annotations, you can -simply tokenize each input word independently, and deterministically maintain an -original-to-tokenized alignment: - -```python -### Input -orig_tokens = ["John", "Johanson", "'s", "house"] -labels = ["NNP", "NNP", "POS", "NN"] - -### Output -bert_tokens = [] - -# Token map will be an int -> int mapping between the `orig_tokens` index and -# the `bert_tokens` index. -orig_to_tok_map = [] - -tokenizer = tokenization.FullTokenizer( - vocab_file=vocab_file, do_lower_case=True) - -bert_tokens.append("[CLS]") -for orig_token in orig_tokens: - orig_to_tok_map.append(len(bert_tokens)) - bert_tokens.extend(tokenizer.tokenize(orig_token)) -bert_tokens.append("[SEP]") - -# bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"] -# orig_to_tok_map == [1, 2, 4, 6] -``` - -Now `orig_to_tok_map` can be used to project `labels` to the tokenized -representation. - -There are common English tokenization schemes which will cause a slight mismatch -between how BERT was pre-trained. For example, if your input tokenization splits -off contractions like `do n't`, this will cause a mismatch. If it is possible to -do so, you should pre-process your data to convert these back to raw-looking -text, but if it's not possible, this mismatch is likely not a big deal. - -## Pre-training with BERT - -We are releasing code to do "masked LM" and "next sentence prediction" on an -arbitrary text corpus. Note that this is *not* the exact code that was used for -the paper (the original code was written in C++, and had some additional -complexity), but this code does generate pre-training data as described in the -paper. - -Here's how to run the data generation. The input is a plain text file, with one -sentence per line. (It is important that these be actual sentences for the "next -sentence prediction" task). Documents are delimited by empty lines. The output -is a set of `tf.train.Example`s serialized into `TFRecord` file format. - -This script stores all of the examples for the entire input file in memory, so -for large data files you should shard the input file and call the script -multiple times. (You can pass in a file glob to `run_pretraining.py`, e.g., -`tf_examples.tf_record*`.) - -The `max_predictions_per_seq` is the maximum number of masked LM predictions per -sequence. You should set this to around `max_seq_length` * `masked_lm_prob` (the -script doesn't do that automatically because the exact value needs to be passed -to both scripts). - -```shell -python create_pretraining_data.py \ - --input_file=./sample_text.txt \ - --output_file=/tmp/tf_examples.tfrecord \ - --vocab_file=$BERT_BASE_DIR/vocab.txt \ - --do_lower_case=True \ - --max_seq_length=128 \ - --max_predictions_per_seq=20 \ - --masked_lm_prob=0.15 \ - --random_seed=12345 \ - --dupe_factor=5 -``` - -Here's how to run the pre-training. Do not include `init_checkpoint` if you are -pre-training from scratch. The model configuration (including vocab size) is -specified in `bert_config_file`. This demo code only pre-trains for a small -number of steps (20), but in practice you will probably want to set -`num_train_steps` to 10000 steps or more. The `max_seq_length` and -`max_predictions_per_seq` parameters passed to `run_pretraining.py` must be the -same as `create_pretraining_data.py`. - -```shell -python run_pretraining.py \ - --input_file=/tmp/tf_examples.tfrecord \ - --output_dir=/tmp/pretraining_output \ - --do_train=True \ - --do_eval=True \ - --bert_config_file=$BERT_BASE_DIR/bert_config.json \ - --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ - --train_batch_size=32 \ - --max_seq_length=128 \ - --max_predictions_per_seq=20 \ - --num_train_steps=20 \ - --num_warmup_steps=10 \ - --learning_rate=2e-5 -``` - -This will produce an output like this: - -``` -***** Eval results ***** - global_step = 20 - loss = 0.0979674 - masked_lm_accuracy = 0.985479 - masked_lm_loss = 0.0979328 - next_sentence_accuracy = 1.0 - next_sentence_loss = 3.45724e-05 -``` - -Note that since our `sample_text.txt` file is very small, this example training -will overfit that data in only a few steps and produce unrealistically high -accuracy numbers. - -### Pre-training tips and caveats - -* If your task has a large domain-specific corpus available (e.g., "movie - reviews" or "scientific papers"), it will likely be beneficial to run - additional steps of pre-training on your corpus, starting from the BERT - checkpoint. -* The learning rate we used in the paper was 1e-4. However, if you are doing - additional steps of pre-training starting from an existing BERT checkpoint, - you should use a smaller learning rate (e.g., 2e-5). -* Current BERT models are English-only, but we do plan to release a - multilingual model which has been pre-trained on a lot of languages in the - near future (hopefully by the end of November 2018). -* Longer sequences are disproportionately expensive because attention is - quadratic to the sequence length. In other words, a batch of 64 sequences of - length 512 is much more expensive than a batch of 256 sequences of - length 128. The fully-connected/convolutional cost is the same, but the - attention cost is far greater for the 512-length sequences. Therefore, one - good recipe is to pre-train for, say, 90,000 steps with a sequence length of - 128 and then for 10,000 additional steps with a sequence length of 512. The - very long sequences are mostly needed to learn positional embeddings, which - can be learned fairly quickly. Note that this does require generating the - data twice with different values of `max_seq_length`. -* If you are pre-training from scratch, be prepared that pre-training is - computationally expensive, especially on GPUs. If you are pre-training from - scratch, our recommended recipe is to pre-train a `BERT-Base` on a single - [preemptable Cloud TPU v2](https://cloud.google.com/tpu/docs/pricing), which - takes about 2 weeks at a cost of about $500 USD (based on the pricing in - October 2018). You will have to scale down the batch size when only training - on a single Cloud TPU, compared to what was used in the paper. It is - recommended to use the largest batch size that fits into TPU memory. - -### Pre-training data - -We will **not** be able to release the pre-processed datasets used in the paper. -For Wikipedia, the recommended pre-processing is to download -[the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), -extract the text with -[`WikiExtractor.py`](https://github.com/attardi/wikiextractor), and then apply -any necessary cleanup to convert it into plain text. - -Unfortunately the researchers who collected the -[BookCorpus](http://yknzhu.wixsite.com/mbweb) no longer have it available for -public download. The -[Project Guttenberg Dataset](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) -is a somewhat smaller (200M word) collection of older books that are public -domain. - -[Common Crawl](http://commoncrawl.org/) is another very large collection of -text, but you will likely have to do substantial pre-processing and cleanup to -extract a usuable corpus for pre-training BERT. - -### Learning a new WordPiece vocabulary - -This repository does not include code for *learning* a new WordPiece vocabulary. -The reason is that the code used in the paper was implemented in C++ with -dependencies on Google's internal libraries. For English, it is almost always -better to just start with our vocabulary and pre-trained models. For learning -vocabularies of other languages, there are a number of open source options -available. However, keep in mind that these are not compatible with our -`tokenization.py` library: - -* [Google's SentencePiece library](https://github.com/google/sentencepiece) - -* [tensor2tensor's WordPiece generation script](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder_build_subword.py) - -* [Rico Sennrich's Byte Pair Encoding library](https://github.com/rsennrich/subword-nmt) - -## Using BERT in Colab - -If you want to use BERT with [Colab](https://colab.sandbox.google.com), you can -get started with the notebook -"[BERT FineTuning with Cloud TPUs](https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)". -**At the time of this writing (October 31st, 2018), Colab users can access a -Cloud TPU completely for free.** Note: One per user, availability limited, -requires a Google Cloud Platform account with storage (although storage may be -purchased with free credit for signing up with GCP), and this capability may not -longer be available in the future. Click on the BERT Colab that was just linked -for more information. - -## FAQ - -#### Is this code compatible with Cloud TPUs? What about GPUs? - -Yes, all of the code in this repository works out-of-the-box with CPU, GPU, and -Cloud TPU. However, GPU training is single-GPU only. - -#### I am getting out-of-memory errors, what is wrong? - -See the section on [out-of-memory issues](#out-of-memory-issues) for more -information. - -#### Is there a PyTorch version available? - -There is no official PyTorch implementation. If someone creates a line-for-line -PyTorch reimplementation so that our pre-trained checkpoints can be directly -converted, we would be happy to link to that PyTorch version here. - -#### Will models in other languages be released? - -Yes, we plan to release a multi-lingual BERT model in the near future. We cannot -make promises about exactly which languages will be included, but it will likely -be a single model which includes *most* of the languages which have a -significantly-sized Wikipedia. - -#### Will models larger than `BERT-Large` be released? - -So far we have not attempted to train anything larger than `BERT-Large`. It is -possible that we will release larger models if we are able to obtain significant -improvements. - -#### What license is this library released under? - -All code *and* models are released under the Apache 2.0 license. See the -`LICENSE` file for more information. - -#### How do I cite BERT? - -For now, cite [the Arxiv paper](https://arxiv.org/abs/1810.04805): - -``` -@article{devlin2018bert, - title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}, - author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, - journal={arXiv preprint arXiv:1810.04805}, - year={2018} -} -``` -If we submit the paper to a conference or journal, we will update the BibTeX. +## Note on pre-training -## Disclaimer +The original TensorFlow code also release two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py). +As the authors notice, pre-training BERT is particularly expensive and requires TPU to run in a reasonable amout of time (see [here](https://github.com/google-research/bert#pre-training-with-bert)). -This is not an official Google product. +We have decided **not** to port these scripts for now and wait for the TPU support on PyTorch (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)). -## Contact information -For help or issues using BERT, please submit a GitHub issue. +## Requirements -For personal communication related to BERT, please contact Jacob Devlin -(`jacobdevlin@google.com`), Ming-Wei Chang (`mingweichang@google.com`), or -Kenton Lee (`kentonl@google.com`). +The main dependencies of this code are: +- PyTorch (>= 0.4.0) +- tqdm \ No newline at end of file