Papers

1. Semi-supervised Sequence Learning

2015 NIPS. Andrew M. Dai, Quoc V. Le. Google.
use unlabeled data to do pretrain, and increase the skill for text classification
how to use unlabeled data ? pretrain
- language model manner, predict next word/token in the sequence
- autoencoder, reads the input sequence into a vector and predicts the input sequence again.
use the weights learned from unlabeled data to initialize the supervised methods

2. Attention Is All You Need

2017 NIPS. Before BERT， Transformer. Google Brain.
Only rely on Attention network, not rely on any RNN or CNN, good at parallelization, and have less train time.
The biggest benefit, however, comes from how The Transformer lends itself to parallelization.
understanding related
- visual transformer
- 香侬读 | Transformer中warm-up和LayerNorm的重要性探究
code related

3. Deep contextualized word representations

ELMo. NAACL, 2018.

4. Improving Language Understanding by Generative Pre-Training

OpenAI GPT (Radford et al., 2018)
use transformer's decoder to do pretrain with language model loss, and then finetune with different tasks
in finetune stage, convert all structured inputs into token sequences to be processed by pre-trained model, followed by a linear+softmax layer
blog
code

5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2018 . Google.
(tensorflow version, but GPU training is single-GPU only.)[https://github.com/google-research/bert]
bidirectional transformer + finetune
the relation bettween bert and transform
- bert use transformer's encoder layer
- the model is constructed in modeling.py (class BertModel) and is pretty much identical to a vanilla Transformer encoder.
base parameters(12 layers)
- 108M (12 layers, hidden: 768, embedding: 768, 参数不共享)
large parameters(24 layers)
- 334M (24 layers, hidden: 1024, embedding: 1024, 参数不共享)
xlarge
- 1270M (24 layers, hidden: 2048, embedding: 2048, 参数不共享)

6. Language Models are Unsupervised Multitask Learners

2019 OpenAI GPT-2.0.
Dateset: WebText. 更大数据和更大网络，LM评价上超过了大部分STOA, 尝试在不需要监督数据的情况下，对下游任务进行预测，超过了很多baseline.
openAI blog
117M model and code
问题：repetitive text, world modeling failures[常识错误], unnatural topic switching;
“zero-shot” setting: Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test
not releasing the dataset, training code, or GPT-2 model weights.
Zero-shot, BPM输入, 数据质量更好+数据更多+网络更深
张俊林, 效果惊人的GPT 2.0模型：它告诉了我们什么

7. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

2019 CMU & Google Brain.
contributions
- a segment-level recurrence mechanism and a relative positional encoding scheme.
codes
- official code, pytorch & tf: kimiyoung/transformer-xl
blogs
- officail blog: Transformer-XL: Unleashing the Potential of Attention Models
applications
- word-level language modeling
- character-level language modeling
- generate relatively coherent long text articles with thousands of tokens trained on only 100M tokens

[8. RoBERTa: A robustly optimized BERT pretraining approach]

todo
2019, Yinhan Liu & ... & Danqi Chen.
- carefully measures the impact of many key hyperparameters and training data size.
- BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
codes
- Official PyTorch

[9. XLNet: Generalized Autoregressive Pretraining for Language Understanding]

todo
2019 Yang et al.
code
- Chinese
  - ymcui/Chinese-PreTrained-XLNet
- English
  - officail: zihangdai/xlnet

10. BERT-wwm-ext, Pre-Training with Whole Word Masking for Chinese BERT

todo

[11. A Lite Bert For Self-Supervised Learning Language Representations ]

done. Albert.
contributions
- reduce parameters
  - width, decomposing the large vocabulary embedding matrix into two small matrices.
  - deep, cross-layer parameter sharing.
  - parameters:
    - base: 12M, layers: 12, hidden: 768, embedding: 128, share parameter
    - large: 18M, layers: 24, hidden: 1024, embedding: 128, share parameter
    - xlarge: 59M, layers: 24, hidden: 2048, embedding: 128, share parameter
    - xxlarge: 233M, layers: 12, hidden: 4096, embedding: 128, share parameter
- loss
  - a self-supervised loss for sentence-order prediction(SOP), which primary focuses on inter-sentence coherence. to address the ineffectiveness of the next sentence prediction(NSP) loss in original BERT.
- others
  - remove dropout to enlarge capacity of model. (train losts of steps but not to overfit in training data)
  - use LAMB as optimizer to train with big batch size
  - use n-gram (P(uni-gram) > P(bi-gram) > P(tri-gram)) as masked language model.
codes
- Chinese
  - brightmart/albert_zh

[12. Language Models as Knowledge Bases?]

1. ACL.

Applications

1. Pretraining-Based Natural Language Generation for Text Summarization.

MSRA. 2019.
将bert应用于文本摘要, 提出two-stage思路：stage-one transformer出摘要草稿, stage-two tansformer带上原文refine.
针对Summary的评价标准Rouge增加了一个额外的loss, 机器内存小, batch_size不能设大, 多steps延迟更新.

Acceleration

[1. LAMB: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES]

todo

Surveys

[1. Pre-trained Models for Natural Language Processing: A Survey](Xipeng Qiu, et. al. 2020)

优化方向

To Do

Fast Transformer

预训练

MT-BERT预训练
- 低精度化
  - Float32(FP32)和Float16(FP16)混合精度训练，加速训练和推理过程；
- 领域自适应
  - 大量美团自己的UGC数据
- 知识融入
  - Knowledge-aware Masking方法将"美团大脑"实体知识融入到MT-BERT预训练中;
- 模型轻量化
  - 模型裁剪、知识蒸馏、Fast Transformer优化，满足上线要求；

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3_Pretraned_Language_Model.md

3_Pretraned_Language_Model.md

Papers

1. Semi-supervised Sequence Learning

2. Attention Is All You Need

3. Deep contextualized word representations

4. Improving Language Understanding by Generative Pre-Training

5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

6. Language Models are Unsupervised Multitask Learners

7. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

[8. RoBERTa: A robustly optimized BERT pretraining approach]

[9. XLNet: Generalized Autoregressive Pretraining for Language Understanding]

10. BERT-wwm-ext, Pre-Training with Whole Word Masking for Chinese BERT

[11. A Lite Bert For Self-Supervised Learning Language Representations ]

[12. Language Models as Knowledge Bases?]

Applications

1. Pretraining-Based Natural Language Generation for Text Summarization.

Acceleration

[1. LAMB: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES]

Surveys

优化方向

To Do

预训练

Files

3_Pretraned_Language_Model.md

Latest commit

History

3_Pretraned_Language_Model.md

File metadata and controls

Papers

[8. RoBERTa: A robustly optimized BERT pretraining approach]

[9. XLNet: Generalized Autoregressive Pretraining for Language Understanding]

[11. A Lite Bert For Self-Supervised Learning Language Representations ]

[12. Language Models as Knowledge Bases?]

Applications

Acceleration

[1. LAMB: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES]

Surveys

优化方向

To Do

预训练