- HitNet: Hybrid Ternary Recurrent Neural Network
- Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
- Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
- Q8BERT: Quantized 8Bit BERT
- Reducing Transformer Depth on Demand with Structured Dropout
- BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
- Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
- Are Sixteen Heads Really Better than One?
- Structured Pruning of Large Language Models
- Pruning a BERT-based Question Answering Model
- DynaBERT: Dynamic BERT with Adaptive Width and Depth
- Another Summary
- TinyBERT: Distilling BERT for Natural Language Understanding
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- A Tensorized Transformer for Language Modeling
- Low-Rank Bottleneck in Multi-head Attention Models
- Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
- What Does BERT Look at? An Analysis of BERT’s Attention
- Visualizing and understanding neural machine translation
- An Analysis of Encoder Representations in Transformer-Based Machine Translation