-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Conversation
Nice benchmarking! I have some graphs here also on fp32 vs. fp16 that I should probably put in the readme somewhere. The requirement for CUDA 10.0 may comes from the fact that not all our dimensions are multiple of 8 (ex. the 12 heads). Padding all dimensions to multiple of 8 seems to help for other people. Do you know why you need the master branch of https://github.com/huggingface/pytorch-pretrained-BERT and version 0.4.0 is not enough? |
Thanks for the link! I actually didn't try out 0.4.0 of https://github.com/huggingface/pytorch-pretrained-BERT just the master branch. After running a test now everything is working with 0.4.0. The allennlp data iterators all pack batches with minimal padding, so in addition to the number of heads, every batch will have a different sequence length. It's a lot easier to say "upgrade to CUDA 10" then changing the low level iterators, especially for a feature like this that will probably never be fully supported. |
@matt-peters Is fp16 also supported when training a Transformer-based ELMo model? |
Not currently, it would take a little work. At a first glance, the main issue is probably the bidirectional transformer implementation isn't fp16 compatible. It would be pretty easy to swap it out for one that is (e.g. re-use the NVIDIA optimized implementation from https://github.com/huggingface/pytorch-pretrained-BERT). The NVIDIA FusedAdam optimizer probably doesn't support sparse gradients so it'd also be necessary switch to full embedding look ups for the softmax (which is a configuration change). |
Hi @matt-peters, the results look great! Are you still working on it? |
@matt-peters, what's the status of this PR? I seem to remember that @brendan-ai2 was going to look at this kind of efficiency stuff soon. Is this PR still useful? |
I believe this implementation is out of date now. I have a more recent implementation at https://github.com/joelgrus/allennlp/tree/fp16 that was working as of ~ a month ago, but I was never able to verify that it led to any improvements, I think Brendan is looking at it now-ish with more appropriate GPUs. |
Ok, I'm going to close this PR then. It's easy to re-open if it turns out we actually want it. |
This is WIP, but is in a working state so will leave it here. This adds half precision (16-bit) training to allennlp on GPU. I was able to get about ~3X speedup and 50% less memory use on a V 100 when fine tuning BERT-base on for text classification (CoLA). Here are some benchmarks from the third epoch of training (since allennlp overhead for indexing the dataset significantly slows down the first epoch):
fp32 without apex: 7.38it/s, 6342MiB memory
fp32 with apex: 7.93it/s, 6030MiB (the speedup comes from using their
FusedLayerNorm
which is an efficient implementation of layer norm)fp16 with BertAdam optimizer: 7.42it/s, 4727MiB
fp16 with FusedAdam optimizer: 23.73it/s, 3721MiB (
FusedAdam
is an efficient GPU implementation of Adam)The caveats:
apex
(https://github.com/nvidia/apex), which I had to fork to incorporate two open issues to get it to work with the Trainer (https://github.com/matt-peters/apex/tree/allennlp, see apex.optimizers.FP16_Optimizer: add state_dict() and load_state_dict() NVIDIA/apex#123 and Error if the gradient of tensor is None. NVIDIA/apex#131)it requires master branch of https://github.com/huggingface/pytorch-pretrained-BERT