Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

WIP: Half precision training #2467

Closed
wants to merge 3 commits into from
Closed

Conversation

matt-peters
Copy link
Contributor

@matt-peters matt-peters commented Jan 31, 2019

This is WIP, but is in a working state so will leave it here. This adds half precision (16-bit) training to allennlp on GPU. I was able to get about ~3X speedup and 50% less memory use on a V 100 when fine tuning BERT-base on for text classification (CoLA). Here are some benchmarks from the third epoch of training (since allennlp overhead for indexing the dataset significantly slows down the first epoch):

fp32 without apex: 7.38it/s, 6342MiB memory
fp32 with apex: 7.93it/s, 6030MiB (the speedup comes from using their FusedLayerNorm which is an efficient implementation of layer norm)
fp16 with BertAdam optimizer: 7.42it/s, 4727MiB
fp16 with FusedAdam optimizer: 23.73it/s, 3721MiB (FusedAdam is an efficient GPU implementation of Adam)

The caveats:

@thomwolf
Copy link
Contributor

Nice benchmarking! I have some graphs here also on fp32 vs. fp16 that I should probably put in the readme somewhere.

The requirement for CUDA 10.0 may comes from the fact that not all our dimensions are multiple of 8 (ex. the 12 heads). Padding all dimensions to multiple of 8 seems to help for other people.

Do you know why you need the master branch of https://github.com/huggingface/pytorch-pretrained-BERT and version 0.4.0 is not enough?

@matt-peters
Copy link
Contributor Author

matt-peters commented Jan 31, 2019

Thanks for the link! I actually didn't try out 0.4.0 of https://github.com/huggingface/pytorch-pretrained-BERT just the master branch. After running a test now everything is working with 0.4.0.

The allennlp data iterators all pack batches with minimal padding, so in addition to the number of heads, every batch will have a different sequence length. It's a lot easier to say "upgrade to CUDA 10" then changing the low level iterators, especially for a feature like this that will probably never be fully supported.

@stefan-it
Copy link

@matt-peters Is fp16 also supported when training a Transformer-based ELMo model?

@matt-peters
Copy link
Contributor Author

matt-peters commented Feb 5, 2019

Not currently, it would take a little work. At a first glance, the main issue is probably the bidirectional transformer implementation isn't fp16 compatible. It would be pretty easy to swap it out for one that is (e.g. re-use the NVIDIA optimized implementation from https://github.com/huggingface/pytorch-pretrained-BERT). The NVIDIA FusedAdam optimizer probably doesn't support sparse gradients so it'd also be necessary switch to full embedding look ups for the softmax (which is a configuration change).

@khoa-ho
Copy link

khoa-ho commented Jun 5, 2019

Hi @matt-peters, the results look great! Are you still working on it?

@matt-gardner
Copy link
Contributor

@matt-peters, what's the status of this PR? I seem to remember that @brendan-ai2 was going to look at this kind of efficiency stuff soon. Is this PR still useful?

@joelgrus
Copy link
Contributor

I believe this implementation is out of date now.

I have a more recent implementation at

https://github.com/joelgrus/allennlp/tree/fp16

that was working as of ~ a month ago, but I was never able to verify that it led to any improvements, I think Brendan is looking at it now-ish with more appropriate GPUs.

@matt-gardner
Copy link
Contributor

Ok, I'm going to close this PR then. It's easy to re-open if it turns out we actually want it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants