Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to train/fine-tune with fp-16 flag? #126

Closed
devsharma8555 opened this issue Apr 19, 2020 · 4 comments
Closed

Is there a way to train/fine-tune with fp-16 flag? #126

devsharma8555 opened this issue Apr 19, 2020 · 4 comments

Comments

@devsharma8555
Copy link

I am trying to train a BERT text classifier for a custom classification task. I have an RTX 2070 for accelerating the workflow.

It does run out memory a lot of times even with small batch-sizes. Is there a way to leverage fp-16 support for training?

It would be really helpful and train better.
Also, love your work! Thank you for creating this library.

@amaiya
Copy link
Owner

amaiya commented Apr 20, 2020

Thanks a lot for your comments.

It looks like there are at least two ways to train with mixed precision.

Method 1:
Add this to top of your script.

import tensorflow as tf
tf.config.optimizer.set_jit(True)
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})

Then, wrap an optimizer in LossScaleOptimizer:

opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt, "dynamic")

... where opt is an optimizer like Adam.
Note that you must recompile whatever model you're using with the new optimizer.

Method 2:
As of TF 2.1, the second way is to add this to the top of your script:

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

Method 2 doesn't seem to work with the transformers library or with keras_bert. See this open transformers issue, for example.

Method 1 works and does yield a speedup but may or may not be worth it for you.

Since mixed precision support in TF 2 is still kind of experimental and a little brittle, I've postponed adding direct support for this in ktrain for the time being. But, you can still experiment on your own using instructions above.

However, if you're having trouble training BERT on your system, I would try to use DistilBert instead of using mixed precision with BERT, as DistilBERT is smaller and faster has nearly the same performance as BERT in my experience:

** DistilBERT example:**

# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)
(x_train, y_train) = (train_b.data, train_b.target)
(x_test, y_test) = (test_b.data, test_b.target)

# build, train, and validate model (Transformer is wrapper around transformers library)
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(5e-5, 4)
learner.validate(class_names=t.get_classes()) # class_names must be string values

# Output from learner.validate()
#                        precision    recall  f1-score   support
#
#           alt.atheism       0.92      0.93      0.93       319
#         comp.graphics       0.97      0.97      0.97       389
#               sci.med       0.97      0.95      0.96       396
#soc.religion.christian       0.96      0.96      0.96       398
#
#              accuracy                           0.96      1502
#             macro avg       0.95      0.96      0.95      1502
#          weighted avg       0.96      0.96      0.96      1502

@devsharma8555
Copy link
Author

Thank you so much for this detailed response!

It most definitely solved the problem.

@henrique
Copy link

as said above thanks for the detailed info @amaiya

I'm running xlm-roberta-large on v3-8 tpu's and the method 1 seems to actually use more memory. I.e. the max batch size I can fit is smaller than without the 3 lines above
From my experience with pytorch apex.amp we should be able to almost double the batch size when using amp (even though I believe amp has to keep an extra copy of the model weights)

Could you anyone get it working properly on TPU's?
Cheers

@amaiya
Copy link
Owner

amaiya commented Apr 27, 2020

I haven't tried mixed precision on TPUs, but this TensorFlow page has information on it including TPU-specific info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants