Language modeling is the task of predicting the next word or character in a document.
* Indicates models using dynamic evaluation.
A common evaluation dataset for language modeling ist the Penn Treebank,
as pre-processed by Mikolov et al. (2010).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with <eos>
,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an <unk>
token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).
Model | Validation perplexity | Test perplexity | Paper / Source |
---|---|---|---|
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | Dynamic Evaluation of Neural Sequence Models |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.9 | 52.8 | Regularizing and Optimizing LSTM Language Models |
AWD-LSTM-DOC (Takase et al., 2018) | 54.12 | 52.38 | Direct Output Connection for a High-Rank Language Model |
AWD-LSTM-MoS (Yang et al., 2018) | 56.54 | 54.44 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM (Merity et al., 2017) | 60.0 | 57.3 | Regularizing and Optimizing LSTM Language Models |
WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.
Model | Validation perplexity | Test perplexity | Paper / Source |
---|---|---|---|
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | Dynamic Evaluation of Neural Sequence Models |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | Regularizing and Optimizing LSTM Language Models |
AWD-LSTM-DOC (Takase et al., 2018) | 60.29 | 58.03 | Direct Output Connection for a High-Rank Language Model |
AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | Regularizing and Optimizing LSTM Language Models |
WikiText-103 The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.
{% include table.html results=site.data.language_modeling.Word_Level.WikiText_103 scores='Validation perplexity,Test perplexity' %}
Model | Validation perplexity | Test perplexity | Paper / Source | Code |
---|---|---|---|---|
LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) | 29.0 | 29.2 | Fast Parametric Learning with Activation Memorization | |
LSTM + Hebbian (Rae et al., 2018) | 34.1 | 34.3 | Fast Parametric Learning with Activation Memorization | |
LSTM (Rae et al., 2018) | 36.0 | 36.4 | Fast Parametric Learning with Activation Memorization | |
Gated CNN (Dauphin et al., 2016) | - | 37.2 | Language modeling with gated convolutional networks | |
Temporal CNN (Bai et al., 2018) | - | 45.2 | Convolutional sequence modeling revisited | |
LSTM (Graves et al., 2014) | - | 48.7 | Neural turing machines |
The Hutter Prize Wikipedia dataset, also known as enwik8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.
Model | Bit per Character (BPC) | Number of params | Paper / Source |
---|---|---|---|
mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | Dynamic Evaluation of Neural Sequence Models |
3 layer AWD-LSTM (Merity et al., 2018) | 1.232 | 47M | An Analysis of Neural Language Modeling at Multiple Scales |
Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | Fast-Slow Recurrent Neural Networks |
Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | Multiplicative LSTM for sequence modelling |
FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | Fast-Slow Recurrent Neural Networks |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks |
The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.
Model | Bit per Character (BPC) | Number of params | Paper / Source |
---|---|---|---|
mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | Dynamic Evaluation of Neural Sequence Models |
Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | Multiplicative LSTM for sequence modelling |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks |
LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 | 35M | Hierarchical Multiscale Recurrent Neural Networks |
BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | Recurrent Batch Normalization |
Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | Multiplicative LSTM for sequence modelling |
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.
Model | Bit per Character (BPC) | Number of params | Paper / Source |
---|---|---|---|
3 layer AWD-LSTM (Merity et al., 2018) | 1.175 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales |
6 layer QRNN (Merity et al., 2018) | 1.187 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales |
FS-LSTM-4 (Mujika et al., 2017) | 1.190 | 27M | Fast-Slow Recurrent Neural Networks |
FS-LSTM-2 (Mujika et al., 2017) | 1.193 | 27M | Fast-Slow Recurrent Neural Networks |
NASCell (Zoph & Le, 2016) | 1.214 | 16.3M | Neural Architecture Search with Reinforcement Learning |
2-Layer Norm HyperLSTM (Ha et al., 2016) | 1.219 | 14.4M | HyperNetworks |