Large training batches on limited GPU hardware #754

fstahlberg · 2018-04-29T07:58:10Z

This PR adds a LargebatchAdam optimizer, which accumulates gradients over n batches and applies the Adam learning rule every n batches on the accumulated gradients. This makes it possible to arbitrarily increase the effective batch size / number of GPUs at cost of more training iterations. This technique is useful if the number of physical GPUs is limited or the GPU memory does not allow to increase the batch size any further. Large batch / multi-GPU training is often important for Transformer training as reported in #444 . See Saunders et al., 2018 for more details.

See transformer_base_fake_gpu8 hparams set as an example.

This is a new version of the PR #750 which fixes issues with the Google CLA.

…n times more training iterations

rsepassi

Thanks @fstahlberg! Very cool optimizer.

I think something of this size also warrants a test.

rsepassi · 2018-05-03T19:46:46Z

tensor2tensor/models/transformer.py

@@ -1081,6 +1081,17 @@ def transformer_base_single_gpu():
  return hparams


+@registry.register_hparams
+def transformer_base_fake_gpu8():
+  """HParams for simulating 8 GPU transformer base model training


Update docstring:
HParams for simulating 8 GPUs with LargebatchAdam optimizer.

rsepassi · 2018-05-03T19:47:17Z

tensor2tensor/utils/largebatch_optimizer.py

+# Dependency imports
+
+import tensorflow as tf
+from tensorflow.python.eager import context


Use tf.contrib.eager.in_eager_mode()

rsepassi · 2018-05-03T19:47:51Z

tensor2tensor/utils/largebatch_optimizer.py

+
+import tensorflow as tf
+from tensorflow.python.eager import context
+from tensorflow.python.framework import ops


Instead of these specific imports, can you switch to accessing through tf?

That was a legacy from using the AdamOptimizer as blueprint. Fixed now.

rsepassi · 2018-05-03T19:48:40Z

tensor2tensor/utils/largebatch_optimizer.py

+
+
+class LargebatchAdamOptimizer(tf.contrib.opt.LazyAdamOptimizer):
+  """Adam with delayed SGD updates."""


Adam with SGD updates every n steps with accumulated gradients.

rsepassi · 2018-05-03T19:49:47Z

tensor2tensor/utils/largebatch_optimizer.py

+    self._n = n  # Call Adam optimizer every n batches with accumulated grads
+    self._n_t = None  # n as tensor
+
+  def _create_slots(self, var_list):


First call super, then just add the gradient accumulators.

rsepassi · 2018-05-03T19:59:50Z

tensor2tensor/utils/largebatch_optimizer.py

+              use_locking=self._use_locking)
+          return control_flow_ops.group(update_beta1, update_beta2)
+        maybe_update_beta = tf.cond(tf.equal(iter_, 0),
+          lambda: update_beta_op(),


update_beta_op

rsepassi · 2018-05-03T20:00:01Z

tensor2tensor/utils/largebatch_optimizer.py

+          return control_flow_ops.group(update_beta1, update_beta2)
+        maybe_update_beta = tf.cond(tf.equal(iter_, 0),
+          lambda: update_beta_op(),
+          lambda: tf.no_op())


rsepassi · 2018-05-03T20:04:16Z

tensor2tensor/utils/learning_rate.py

+      fake_gpu_multiplier = tf.constant(hparams.fake_gpu_multiplier,
+                                        dtype=tf.float32)
+      step = step / fake_gpu_multiplier
+      tf.logging.info("Scaling down learning rate decay by "


Divided global step by fake_gpu_multiplier=%d

rsepassi · 2018-05-03T20:05:18Z

tensor2tensor/models/transformer.py

+  """
+  hparams = transformer_base()
+  hparams.optimizer = "LargebatchAdam"
+  hparams.add_hparam("fake_gpu_multiplier", 8)


add this to common_hparams.py basic_params1 instead with a default value of None

Done. I renamed it to optimizer_multistep_accumulate_steps for consistency with the optimizer_ada[m|factor]_* options

rsepassi · 2018-05-03T20:06:26Z

tensor2tensor/utils/largebatch_optimizer.py

+
+See [Saunders et al., 2018](https://arxiv.org/abs/1805.00456) for details.
+"""
+


I'd like to find a different name for the file and the class.

How about multistep_optimizer and MultistepAdamOptimizer?

vince62s · 2018-05-06T20:21:32Z

@fstahlberg I confirm this is great, we did it and use it in opennmt-py works fine.
However out of curiosity did you try it with 8 "accumulation" ?
I was able to work with 4 without any problem (for a bs of 4096) but I was not able to fit on a gtx 1080ti
thanks.

fstahlberg · 2018-05-07T00:42:58Z

@vince62s Yes, I tried it with "delay factor" 8 - there should be no difference regarding GPU memory between 4 and 8. Did you use the same code?

@rsepassi thanks for the review, I'll work on it in the next few days.

rsepassi · 2018-05-07T01:55:03Z

One more thought. Maybe the hparam name should be something like multistep_opt_accumulate_steps.

…

On Sun, May 6, 2018 at 5:43 PM fstahlberg ***@***.***> wrote: @vince62s <https://github.com/vince62s> Yes, I tried it with "delay factor" 8 - there should be no difference regarding GPU memory between 4 and 8. Did you use the same code? @rsepassi <https://github.com/rsepassi> thanks for the review, I'll work on it in the next few days. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#754 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEGW0QOLAObi5CdQ-B6oWaQPOwiVH3Eks5tv5iUgaJpZM4TrtYF> .

fstahlberg

Pushed all the code changes. tests will follow soon-ish.

It is now called MultistepAdamOptimizer and optimizer_multistep_accumulate_steps.

fstahlberg · 2018-05-07T02:44:31Z

tensor2tensor/models/transformer.py

+  """
+  hparams = transformer_base()
+  hparams.optimizer = "LargebatchAdam"
+  hparams.add_hparam("fake_gpu_multiplier", 8)


Done. I renamed it to optimizer_multistep_accumulate_steps for consistency with the optimizer_ada[m|factor]_* options

fstahlberg · 2018-05-07T02:47:32Z

tensor2tensor/utils/largebatch_optimizer.py

+
+import tensorflow as tf
+from tensorflow.python.eager import context
+from tensorflow.python.framework import ops


That was a legacy from using the AdamOptimizer as blueprint. Fixed now.

fstahlberg · 2018-05-07T02:49:43Z

tensor2tensor/utils/largebatch_optimizer.py

+      super(LargebatchAdamOptimizer, self)._apply_sparse_shared, grad, var,
+      indices, scatter_add)
+
+  def _apply_sparse(self, grad, var):


Sure that I can do that? The optimizer works even for sparse tensors, just not as efficient as it could be as I'm simply converting to dense and use _apply_dense

fstahlberg · 2018-05-13T22:33:40Z

@rsepassi I've added a unit test, but I had to put an awkward version check to pass all tests since it doesn't work with TF < 1.6.

rsepassi

Very sorry for the long delay. NIPS deadline had us quite busy. Looks good though I think the test should be modified.

rsepassi · 2018-05-07T23:45:39Z

tensor2tensor/utils/multistep_optimizer.py

+  """Adam with SGD updates every n steps with accumulated gradients."""
+
+  def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
+               use_locking=False, name="Adam", n=2):


let's make the default 1

rsepassi · 2018-05-18T21:30:02Z

tensor2tensor/utils/learning_rate.py

+  """Adjust global step if a multi-step optimizer is used."""
+  step = tf.to_float(tf.train.get_or_create_global_step())
+  multiplier = hparams.optimizer_multistep_accumulate_steps
+  if multiplier is not None and multiplier > 1:


if multiplier:

rsepassi · 2018-05-18T21:31:12Z

tensor2tensor/utils/learning_rate.py

+  step = tf.to_float(tf.train.get_or_create_global_step())
+  multiplier = hparams.optimizer_multistep_accumulate_steps
+  if multiplier is not None and multiplier > 1:
+    step = step / tf.constant(multiplier, dtype=tf.float32)


tf.to_float(step) / tf.to_float(multiplier)

done (step is already a float tensor)

rsepassi · 2018-05-18T21:33:49Z

tensor2tensor/utils/multistep_optimizer_test.py

@@ -0,0 +1,123 @@
+# coding=utf-8


Thank you for adding this test, but I think the test should be a bit different and hopefully simpler:

Compare 2 things:

AdamOptimizer with batch size 32 for 1 step

MultistepAdamOptimizer with batch size 8 for 4 steps with n=4

We should see that the updates are identical (i.e. the variables end up in the same place)

Hm am I not doing something like this? Just that I don't compare batch sizes but number of updates. For example, I compare the variables after

Adam: t steps with averaged gradients over n steps

MultistepAdam: t*n steps

for n=1..4 and t=1..3. Adam is implemented in numpy to avoid introducing dependencies from this test class to other parts of the TF code (like in the original Adam test)

So the original Adam test uses the numpy because it's actually checking the mathematical accuracy of the implementation. Here we want to ensure that the MultistepAdamOptimizer is a drop-in replacement for AdamOptimizer and simulates a larger batch size, so I think the most clear and useful test would test exactly that. Do you agree?

Alright np I'll change it

fstahlberg · 2018-06-03T18:55:00Z

Test updated

… of numpy Adam)

rsepassi · 2018-06-05T19:55:48Z

Looks great! Thanks so much for this contribution @fstahlberg. Really good work.

PiperOrigin-RevId: 199354554

…dware (tensorflow#754) Simulates n times more GPUs at cost of n times more training iterations

PiperOrigin-RevId: 199354554

senarvi · 2018-06-20T12:13:24Z

tensor2tensor/layers/common_hparams.py

@@ -64,6 +64,8 @@ def basic_params1():
      optimizer_adafactor_memory_exponent=0.8,
      optimizer_adafactor_clipping_threshold=1.0,
      optimizer_adafactor_multiply_by_parameter_scale=True,
+      # Number of accumulating steps for multi step optimizers.
+      optimizer_multistep_accumulate_steps=None,


I think the default value should be 1 instead of None. Otherwise "NoneType takes no arguments" error will occur when parsing the value from --hparams flag.

nxphi47 · 2018-09-29T12:29:32Z

Hello, with hparams, how many --train_steps do we set in the training script to replicate exactly 100000 steps with 8 real gpus in the transformer paper?

Is it still --train_steps=100000 or --train_steps=800000 ?

fstahlberg · 2018-09-29T12:48:27Z

@nxphi47 It is --train_steps=800000

XiaoqingNLP · 2018-11-01T01:49:29Z

@fstahlberg Thank you so much and I want to know How about the performence please ?

fstahlberg · 2018-11-02T08:11:32Z

@zxqchat For example, if you set optimizer_multistep_accumulate_steps to 8 and multiply train_steps with 8 you get the same performance as with 8 times more GPUs.

fstahlberg mentioned this pull request Apr 29, 2018

Large training batches on limited GPU hardware #750

Closed

fstahlberg added 2 commits April 29, 2018 16:47

LargebatchAdam optimizer for simulating n times more GPUs at cost of …

935a811

…n times more training iterations

Add cite for delaying SGD updates in largebatch_optimizer.py

3bf649d

rsepassi suggested changes May 3, 2018

View reviewed changes

fstahlberg commented May 7, 2018

View reviewed changes

fstahlberg added 2 commits May 7, 2018 11:43

Requested code changes from review for PR #754

1a25e0d

Python 3 fix

cc8c645

fstahlberg added 4 commits May 14, 2018 05:57

- Add unit test for MultistepAdamOptimizer

c0106e8

Merge remote-tracking branch 'upstream/master'

a85d017

Fix failing tests on Python 3 and TF 1.5

e6c8310

Python 3 fix

b8a0087

rsepassi suggested changes May 18, 2018

View reviewed changes

fstahlberg added 2 commits May 21, 2018 03:54

Merge remote-tracking branch 'upstream/master'

198784a

Cosmetics

af41d90

guillaumekln mentioned this pull request May 31, 2018

Adam optimizer that accumulates gradients of multiple steps OpenNMT/OpenNMT-tf#141

Merged

Merge remote-tracking branch 'upstream/master'

c1ac3dc

googlebot added the cla: yes PR author has signed CLA label Jun 3, 2018

fstahlberg added 3 commits June 4, 2018 02:35

- Change multistep optimizer test (test against AdamOptimizer instead…

703cd48

… of numpy Adam)

Merge remote-tracking branch 'upstream/master'

6f419ad

Lint warnings in multistep_optimizer and multistep_optimizer_test

c537112

rsepassi approved these changes Jun 5, 2018

View reviewed changes

rsepassi merged commit 64e1df1 into tensorflow:master Jun 5, 2018

tensorflow-copybara pushed a commit that referenced this pull request Jun 5, 2018

internal merge of PR #754

1d6cb55

PiperOrigin-RevId: 199354554

whr94621 pushed a commit to whr94621/tensor2tensor that referenced this pull request Jun 12, 2018

Add MultistepAdamOptimizer: Large training batches on limited GPU har…

b7ab5b7

…dware (tensorflow#754) Simulates n times more GPUs at cost of n times more training iterations

whr94621 pushed a commit to whr94621/tensor2tensor that referenced this pull request Jun 12, 2018

internal merge of PR tensorflow#754

0f41c8f

PiperOrigin-RevId: 199354554

senarvi reviewed Jun 20, 2018

View reviewed changes



		class LargebatchAdamOptimizer(tf.contrib.opt.LazyAdamOptimizer):
		"""Adam with delayed SGD updates."""


		See [Saunders et al., 2018](https://arxiv.org/abs/1805.00456) for details.
		"""

Large training batches on limited GPU hardware #754

Large training batches on limited GPU hardware #754

Conversation

fstahlberg commented Apr 29, 2018 • edited Loading

rsepassi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vince62s commented May 6, 2018

fstahlberg commented May 7, 2018

rsepassi commented May 7, 2018 via email

fstahlberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fstahlberg commented May 13, 2018

rsepassi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fstahlberg commented Jun 3, 2018

rsepassi commented Jun 5, 2018

Choose a reason for hiding this comment

nxphi47 commented Sep 29, 2018

fstahlberg commented Sep 29, 2018

XiaoqingNLP commented Nov 1, 2018

fstahlberg commented Nov 2, 2018

fstahlberg commented Apr 29, 2018 •

edited

Loading