Is there a way to train/fine-tune with fp-16 flag? #126

devsharma8555 · 2020-04-19T06:56:52Z

I am trying to train a BERT text classifier for a custom classification task. I have an RTX 2070 for accelerating the workflow.

It does run out memory a lot of times even with small batch-sizes. Is there a way to leverage fp-16 support for training?

It would be really helpful and train better.
Also, love your work! Thank you for creating this library.

amaiya · 2020-04-20T13:15:55Z

Thanks a lot for your comments.

It looks like there are at least two ways to train with mixed precision.

Method 1:
Add this to top of your script.

import tensorflow as tf
tf.config.optimizer.set_jit(True)
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})

Then, wrap an optimizer in LossScaleOptimizer:

opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt, "dynamic")

... where opt is an optimizer like Adam.
Note that you must recompile whatever model you're using with the new optimizer.

Method 2:
As of TF 2.1, the second way is to add this to the top of your script:

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

Method 2 doesn't seem to work with the transformers library or with keras_bert. See this open transformers issue, for example.

Method 1 works and does yield a speedup but may or may not be worth it for you.

Since mixed precision support in TF 2 is still kind of experimental and a little brittle, I've postponed adding direct support for this in ktrain for the time being. But, you can still experiment on your own using instructions above.

However, if you're having trouble training BERT on your system, I would try to use DistilBert instead of using mixed precision with BERT, as DistilBERT is smaller and faster has nearly the same performance as BERT in my experience:

** DistilBERT example:**

# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)
(x_train, y_train) = (train_b.data, train_b.target)
(x_test, y_test) = (test_b.data, test_b.target)

# build, train, and validate model (Transformer is wrapper around transformers library)
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(5e-5, 4)
learner.validate(class_names=t.get_classes()) # class_names must be string values

# Output from learner.validate()
#                        precision    recall  f1-score   support
#
#           alt.atheism       0.92      0.93      0.93       319
#         comp.graphics       0.97      0.97      0.97       389
#               sci.med       0.97      0.95      0.96       396
#soc.religion.christian       0.96      0.96      0.96       398
#
#              accuracy                           0.96      1502
#             macro avg       0.95      0.96      0.95      1502
#          weighted avg       0.96      0.96      0.96      1502

devsharma8555 · 2020-04-20T20:10:44Z

Thank you so much for this detailed response!

It most definitely solved the problem.

henrique · 2020-04-27T13:09:46Z

as said above thanks for the detailed info @amaiya

I'm running xlm-roberta-large on v3-8 tpu's and the method 1 seems to actually use more memory. I.e. the max batch size I can fit is smaller than without the 3 lines above
From my experience with pytorch apex.amp we should be able to almost double the batch size when using amp (even though I believe amp has to keep an extra copy of the model weights)

Could you anyone get it working properly on TPU's?
Cheers

amaiya · 2020-04-27T14:06:47Z

I haven't tried mixed precision on TPUs, but this TensorFlow page has information on it including TPU-specific info.

devsharma8555 closed this as completed Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to train/fine-tune with fp-16 flag? #126

Is there a way to train/fine-tune with fp-16 flag? #126

devsharma8555 commented Apr 19, 2020

amaiya commented Apr 20, 2020

devsharma8555 commented Apr 20, 2020

henrique commented Apr 27, 2020

amaiya commented Apr 27, 2020

Is there a way to train/fine-tune with fp-16 flag? #126

Is there a way to train/fine-tune with fp-16 flag? #126

Comments

devsharma8555 commented Apr 19, 2020

amaiya commented Apr 20, 2020

devsharma8555 commented Apr 20, 2020

henrique commented Apr 27, 2020

amaiya commented Apr 27, 2020