- 2015 NIPS. Andrew M. Dai, Quoc V. Le. Google.
- use unlabeled data to do pretrain, and increase the skill for text classification
- how to use unlabeled data ? pretrain
- language model manner, predict next word/token in the sequence
- autoencoder, reads the input sequence into a vector and predicts the input sequence again.
- use the weights learned from unlabeled data to initialize the supervised methods
- 2017 NIPS. Before BERT, Transformer. Google Brain.
- Only rely on Attention network, not rely on any RNN or CNN, good at parallelization, and have less train time.
- The biggest benefit, however, comes from how The Transformer lends itself to parallelization.
- understanding related
- code related
- ELMo. NAACL, 2018.
- OpenAI GPT (Radford et al., 2018)
- use transformer's decoder to do pretrain with language model loss, and then finetune with different tasks
- in finetune stage, convert all structured inputs into token sequences to be processed by pre-trained model, followed by a linear+softmax layer
- blog
- code
- 2018 . Google.
- (tensorflow version, but GPU training is single-GPU only.)[https://github.com/google-research/bert]
- bidirectional transformer + finetune
- the relation bettween bert and transform
- bert use transformer's encoder layer
- the model is constructed in modeling.py (class BertModel) and is pretty much identical to a vanilla Transformer encoder.
- base parameters(12 layers)
- 108M (12 layers, hidden: 768, embedding: 768, 参数不共享)
- large parameters(24 layers)
- 334M (24 layers, hidden: 1024, embedding: 1024, 参数不共享)
- xlarge
- 1270M (24 layers, hidden: 2048, embedding: 2048, 参数不共享)
- 2019 OpenAI GPT-2.0.
- Dateset: WebText. 更大数据和更大网络,LM评价上超过了大部分STOA, 尝试在不需要监督数据的情况下,对下游任务进行预测,超过了很多baseline.
- openAI blog
- 117M model and code
- 问题:repetitive text, world modeling failures[常识错误], unnatural topic switching;
- “zero-shot” setting: Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test
- not releasing the dataset, training code, or GPT-2 model weights.
- Zero-shot, BPM输入, 数据质量更好+数据更多+网络更深
- 张俊林, 效果惊人的GPT 2.0模型:它告诉了我们什么
- 2019 CMU & Google Brain.
- contributions
- a segment-level recurrence mechanism and a relative positional encoding scheme.
- codes
- official code, pytorch & tf: kimiyoung/transformer-xl
- blogs
- officail blog: Transformer-XL: Unleashing the Potential of Attention Models
- applications
- word-level language modeling
- character-level language modeling
- generate relatively coherent long text articles with thousands of tokens trained on only 100M tokens
- todo
- 2019, Yinhan Liu & ... & Danqi Chen.
- carefully measures the impact of many key hyperparameters and training data size.
- BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
- codes
- todo
- 2019 Yang et al.
- code
- Chinese
- English
- todo
-
done. Albert.
-
contributions
-
reduce parameters
- width, decomposing the large vocabulary embedding matrix into two small matrices.
- deep, cross-layer parameter sharing.
- parameters:
- base: 12M, layers: 12, hidden: 768, embedding: 128, share parameter
- large: 18M, layers: 24, hidden: 1024, embedding: 128, share parameter
- xlarge: 59M, layers: 24, hidden: 2048, embedding: 128, share parameter
- xxlarge: 233M, layers: 12, hidden: 4096, embedding: 128, share parameter
-
loss
- a self-supervised loss for sentence-order prediction(SOP), which primary focuses on inter-sentence coherence. to address the ineffectiveness of the next sentence prediction(NSP) loss in original BERT.
-
others
- remove dropout to enlarge capacity of model. (train losts of steps but not to overfit in training data)
- use LAMB as optimizer to train with big batch size
- use n-gram (P(uni-gram) > P(bi-gram) > P(tri-gram)) as masked language model.
-
-
codes
- Chinese
-
- ACL.
- MSRA. 2019.
- 将bert应用于文本摘要, 提出two-stage思路:stage-one transformer出摘要草稿, stage-two tansformer带上原文refine.
- 针对Summary的评价标准Rouge增加了一个额外的loss, 机器内存小, batch_size不能设大, 多steps延迟更新.
- todo
-
[1. Pre-trained Models for Natural Language Processing: A Survey](Xipeng Qiu, et. al. 2020)
- Fast Transformer
- MT-BERT预训练
- 低精度化
- Float32(FP32)和Float16(FP16)混合精度训练,加速训练和推理过程;
- 领域自适应
- 大量美团自己的UGC数据
- 知识融入
- Knowledge-aware Masking方法将"美团大脑"实体知识融入到MT-BERT预训练中;
- 模型轻量化
- 模型裁剪、知识蒸馏、Fast Transformer优化,满足上线要求;
- 低精度化