Skip to content

Latest commit

 

History

History
188 lines (144 loc) · 8.12 KB

3_Pretraned_Language_Model.md

File metadata and controls

188 lines (144 loc) · 8.12 KB

Papers

  • 2015 NIPS. Andrew M. Dai, Quoc V. Le. Google.
  • use unlabeled data to do pretrain, and increase the skill for text classification
  • how to use unlabeled data ? pretrain
    • language model manner, predict next word/token in the sequence
    • autoencoder, reads the input sequence into a vector and predicts the input sequence again.
  • use the weights learned from unlabeled data to initialize the supervised methods
  • ELMo. NAACL, 2018.
  • OpenAI GPT (Radford et al., 2018)
  • use transformer's decoder to do pretrain with language model loss, and then finetune with different tasks
  • in finetune stage, convert all structured inputs into token sequences to be processed by pre-trained model, followed by a linear+softmax layer
  • blog
  • code
  • 2018 . Google.
  • (tensorflow version, but GPU training is single-GPU only.)[https://github.com/google-research/bert]
  • bidirectional transformer + finetune
  • the relation bettween bert and transform
    • bert use transformer's encoder layer
    • the model is constructed in modeling.py (class BertModel) and is pretty much identical to a vanilla Transformer encoder.
  • base parameters(12 layers)
    • 108M (12 layers, hidden: 768, embedding: 768, 参数不共享)
  • large parameters(24 layers)
    • 334M (24 layers, hidden: 1024, embedding: 1024, 参数不共享)
  • xlarge
    • 1270M (24 layers, hidden: 2048, embedding: 2048, 参数不共享)
  • 2019 OpenAI GPT-2.0.
  • Dateset: WebText. 更大数据和更大网络,LM评价上超过了大部分STOA, 尝试在不需要监督数据的情况下,对下游任务进行预测,超过了很多baseline.
  • openAI blog
  • 117M model and code
  • 问题:repetitive text, world modeling failures[常识错误], unnatural topic switching;
  • “zero-shot” setting: Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test
  • not releasing the dataset, training code, or GPT-2 model weights.
  • Zero-shot, BPM输入, 数据质量更好+数据更多+网络更深
  • 张俊林, 效果惊人的GPT 2.0模型:它告诉了我们什么
  • 2019 CMU & Google Brain.
  • contributions
    • a segment-level recurrence mechanism and a relative positional encoding scheme.
  • codes
  • blogs
  • applications
    • word-level language modeling
    • character-level language modeling
    • generate relatively coherent long text articles with thousands of tokens trained on only 100M tokens

[8. RoBERTa: A robustly optimized BERT pretraining approach]

  • todo
  • 2019, Yinhan Liu & ... & Danqi Chen.
    • carefully measures the impact of many key hyperparameters and training data size.
    • BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
  • codes

[9. XLNet: Generalized Autoregressive Pretraining for Language Understanding]

  • todo

[11. A Lite Bert For Self-Supervised Learning Language Representations ]

  • done. Albert.

  • contributions

    • reduce parameters

      • width, decomposing the large vocabulary embedding matrix into two small matrices.
      • deep, cross-layer parameter sharing.
      • parameters:
        • base: 12M, layers: 12, hidden: 768, embedding: 128, share parameter
        • large: 18M, layers: 24, hidden: 1024, embedding: 128, share parameter
        • xlarge: 59M, layers: 24, hidden: 2048, embedding: 128, share parameter
        • xxlarge: 233M, layers: 12, hidden: 4096, embedding: 128, share parameter
    • loss

      • a self-supervised loss for sentence-order prediction(SOP), which primary focuses on inter-sentence coherence. to address the ineffectiveness of the next sentence prediction(NSP) loss in original BERT.
    • others

      • remove dropout to enlarge capacity of model. (train losts of steps but not to overfit in training data)
      • use LAMB as optimizer to train with big batch size
      • use n-gram (P(uni-gram) > P(bi-gram) > P(tri-gram)) as masked language model.
  • codes

[12. Language Models as Knowledge Bases?]

    1. ACL.

Applications

  • MSRA. 2019.
  • 将bert应用于文本摘要, 提出two-stage思路:stage-one transformer出摘要草稿, stage-two tansformer带上原文refine.
  • 针对Summary的评价标准Rouge增加了一个额外的loss, 机器内存小, batch_size不能设大, 多steps延迟更新.

Acceleration

[1. LAMB: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES]

  • todo

Surveys

[1. Pre-trained Models for Natural Language Processing: A Survey](Xipeng Qiu, et. al. 2020)

优化方向

To Do

  • Fast Transformer

预训练

  • MT-BERT预训练
    • 低精度化
      • Float32(FP32)和Float16(FP16)混合精度训练,加速训练和推理过程;
    • 领域自适应
      • 大量美团自己的UGC数据
    • 知识融入
      • Knowledge-aware Masking方法将"美团大脑"实体知识融入到MT-BERT预训练中;
    • 模型轻量化
      • 模型裁剪、知识蒸馏、Fast Transformer优化,满足上线要求;