WIP: Half precision training #2467

matt-peters · 2019-01-31T19:33:05Z

This is WIP, but is in a working state so will leave it here. This adds half precision (16-bit) training to allennlp on GPU. I was able to get about ~3X speedup and 50% less memory use on a V 100 when fine tuning BERT-base on for text classification (CoLA). Here are some benchmarks from the third epoch of training (since allennlp overhead for indexing the dataset significantly slows down the first epoch):

fp32 without apex: 7.38it/s, 6342MiB memory
fp32 with apex: 7.93it/s, 6030MiB (the speedup comes from using their FusedLayerNorm which is an efficient implementation of layer norm)
fp16 with BertAdam optimizer: 7.42it/s, 4727MiB
fp16 with FusedAdam optimizer: 23.73it/s, 3721MiB (FusedAdam is an efficient GPU implementation of Adam)

The caveats:

I had to install CUDA 10.0 and cudnn 7.4.1
It requires apex (https://github.com/nvidia/apex), which I had to fork to incorporate two open issues to get it to work with the Trainer (https://github.com/matt-peters/apex/tree/allennlp, see apex.optimizers.FP16_Optimizer: add state_dict() and load_state_dict() NVIDIA/apex#123 and Error if the gradient of tensor is None. NVIDIA/apex#131)
~~it requires master branch of https://github.com/huggingface/pytorch-pretrained-BERT~~
It's only been tested in a very narrow use case, and I'm sure there are landmines elsewhere in allennlp (e.g. "grep -r "float()" * | wc" shows 146 potentially problematic lines)

thomwolf · 2019-01-31T21:15:14Z

Nice benchmarking! I have some graphs here also on fp32 vs. fp16 that I should probably put in the readme somewhere.

The requirement for CUDA 10.0 may comes from the fact that not all our dimensions are multiple of 8 (ex. the 12 heads). Padding all dimensions to multiple of 8 seems to help for other people.

Do you know why you need the master branch of https://github.com/huggingface/pytorch-pretrained-BERT and version 0.4.0 is not enough?

matt-peters · 2019-01-31T22:12:00Z

Thanks for the link! I actually didn't try out 0.4.0 of https://github.com/huggingface/pytorch-pretrained-BERT just the master branch. After running a test now everything is working with 0.4.0.

The allennlp data iterators all pack batches with minimal padding, so in addition to the number of heads, every batch will have a different sequence length. It's a lot easier to say "upgrade to CUDA 10" then changing the low level iterators, especially for a feature like this that will probably never be fully supported.

stefan-it · 2019-02-04T22:29:28Z

@matt-peters Is fp16 also supported when training a Transformer-based ELMo model?

matt-peters · 2019-02-05T17:43:05Z

Not currently, it would take a little work. At a first glance, the main issue is probably the bidirectional transformer implementation isn't fp16 compatible. It would be pretty easy to swap it out for one that is (e.g. re-use the NVIDIA optimized implementation from https://github.com/huggingface/pytorch-pretrained-BERT). The NVIDIA FusedAdam optimizer probably doesn't support sparse gradients so it'd also be necessary switch to full embedding look ups for the softmax (which is a configuration change).

khoa-ho · 2019-06-05T17:07:31Z

Hi @matt-peters, the results look great! Are you still working on it?

matt-gardner · 2019-06-14T15:09:40Z

@matt-peters, what's the status of this PR? I seem to remember that @brendan-ai2 was going to look at this kind of efficiency stuff soon. Is this PR still useful?

joelgrus · 2019-06-14T15:33:30Z

I believe this implementation is out of date now.

I have a more recent implementation at

https://github.com/joelgrus/allennlp/tree/fp16

that was working as of ~ a month ago, but I was never able to verify that it led to any improvements, I think Brendan is looking at it now-ish with more appropriate GPUs.

matt-gardner · 2019-06-14T15:47:23Z

Ok, I'm going to close this PR then. It's easy to re-open if it turns out we actually want it.

matt-peters added 3 commits January 31, 2019 01:46

fp16

e92ad16

fp16

20e7382

allow all optimizers for fp16

235cc83

matt-gardner closed this Jun 14, 2019

matthew-z mentioned this pull request Aug 15, 2019

Mixed Precision Training #2149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Half precision training #2467

WIP: Half precision training #2467

matt-peters commented Jan 31, 2019 •

edited

Loading

thomwolf commented Jan 31, 2019

matt-peters commented Jan 31, 2019 •

edited

Loading

stefan-it commented Feb 4, 2019

matt-peters commented Feb 5, 2019 •

edited

Loading

khoa-ho commented Jun 5, 2019

matt-gardner commented Jun 14, 2019

joelgrus commented Jun 14, 2019

matt-gardner commented Jun 14, 2019

WIP: Half precision training #2467

WIP: Half precision training #2467

Conversation

matt-peters commented Jan 31, 2019 • edited Loading

thomwolf commented Jan 31, 2019

matt-peters commented Jan 31, 2019 • edited Loading

stefan-it commented Feb 4, 2019

matt-peters commented Feb 5, 2019 • edited Loading

khoa-ho commented Jun 5, 2019

matt-gardner commented Jun 14, 2019

joelgrus commented Jun 14, 2019

matt-gardner commented Jun 14, 2019

matt-peters commented Jan 31, 2019 •

edited

Loading

matt-peters commented Jan 31, 2019 •

edited

Loading

matt-peters commented Feb 5, 2019 •

edited

Loading