Support accumulating DDP grads using a context manager #21736

mrshenli · 2019-06-13T15:30:31Z

The first attempt and more discussions are available in #19577

Goal

Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations.

Concerns

Our first attempt in #19577 tries to do it using a variable or a function. But @apaszke made a good point that it will not be error prone, and favors a context manager instead.

Proposed Solution

Instead of providing a accumulate_grads variable/function/context, we provide a DistributedDataParallel.no_sync() context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that accumulate_grads means no_sync + no optimizer step, where the latter is not controlled by DDP.

It is true that users need to call another model(input).backward() after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required..

The application would then look like:

with ddp.no_sync():
  for input in inputs:
    ddp(input).backward()

ddp(one_more_input).backward() 
optimizer.step()

@chenyangyu1988 @myleott

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-06-13T20:00:32Z

The failed test passed in rerun.

pietern

LGTM in general, but with 2 comments:

Would it be possible to consolidate some of the duplication in the test code into some kind of helper class, or a set of helper functions? There is a lot of duplication between these tests and I think there are at least half a dozen more similar to these.
I think there is interaction between no_sync and the _sync_params function that could cause unintended results. For example, the gradients of model replicas are detached and zeroed in every iteration, whereas they should also accumulate. Then there is the question of the batch normalization buffer synchronization in every call to forward... not sure what to do about that one.

mrshenli · 2019-06-17T15:04:18Z

Would it be possible to consolidate some of the duplication in the test code into some kind of helper class, or a set of helper functions? There is a lot of duplication between these tests and I think there are at least half a dozen more similar to these.

Yes, let me try

I think there is interaction between no_sync and the _sync_params function that could cause unintended results. For example, the gradients of model replicas are detached and zeroed in every iteration, whereas they should also accumulate. Then there is the question of the batch normalization buffer synchronization in every call to forward... not sure what to do about that one.

Sorry, I forgot about this. How about the following two options:

Only allow creating no_sync context if the DDP does not contain module replicas, i.e., _sync_params becomes no op.
The _sync_params should be called if the previous iteration invoked prepare_for_backward. So, maybe we can add an additional var recording prepare_for_backward was invoked last time?

pietern

Looks good to me!

I think the new tests didn't fail because it doesn't require any buffer synchronization. If it did, it would have yielded different results without the require_forward_param_sync option.

mrshenli · 2019-06-18T14:51:37Z

I think the new tests didn't fail because it doesn't require any buffer synchronization. If it did, it would have yielded different results without the require_forward_param_sync option.

@pietern do we want to raise an exception if no_sync is called from model with buffers?

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-06-18T15:33:47Z

@pytorchbot rebase this please

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-20T22:24:08Z

@mrshenli merged this pull request in 08facca.

apaszke · 2019-06-21T15:00:17Z

torch/nn/parallel/distributed.py

+        """
+        old_require_backward_grad_sync = self.require_backward_grad_sync
+        self.require_backward_grad_sync = False
+        yield


This should be in a try ... finally block, because in case of an exception you will fail to restore the flag!

apaszke · 2019-06-21T15:01:36Z

torch/nn/parallel/distributed.py

@@ -272,6 +273,8 @@ def __init__(self, module, device_ids=None,
        self.module = module
        self.broadcast_buffers = broadcast_buffers
        self.find_unused_parameters = find_unused_parameters
+        self.require_backward_grad_sync = True
+        self.require_forward_param_sync = True


I'm not sure if DistributedDataParallel is picklable, but if it is, then you should add a __setstate__ that adds those two attributes, because otherwise people who load older checkpoints will get missing attribute errors.

apaszke · 2019-06-21T15:09:08Z

I've fixed the problems in a PR that's referenced above.

Summary: The first attempt and more discussions are available in pytorch#19577 #### Goal Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations. #### Concerns Our first attempt in pytorch#19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead. #### Proposed Solution Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP. It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required.. The application would then look like: ```python with ddp.no_sync(): for input in inputs: ddp(input).backward() ddp(one_more_input).backward() optimizer.step() ``` chenyangyu1988 myleott Pull Request resolved: pytorch#21736 Differential Revision: D15805215 Pulled By: mrshenli fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c

Summary: cc mrshenli Pull Request resolved: #22074 Differential Revision: D15965376 Pulled By: mrshenli fbshipit-source-id: 50ff96de6390817d8ea52c04322c6bee3d649b32

support accumulating DDP grads using a context manager

498d6e5

mrshenli requested review from pietern and apaszke June 13, 2019 15:30

pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nn Related to torch.nn labels Jun 13, 2019

mrshenli mentioned this pull request Jun 13, 2019

accumulate grads #19577

Closed

facebook-github-bot reviewed Jun 13, 2019

View reviewed changes

pietern reviewed Jun 17, 2019

View reviewed changes

address comments

e2ecfb3

ngimel mentioned this pull request Jun 17, 2019

Add option to turn on/off allreduce in DDP (useful for gradient accum… NVIDIA/apex#356

Merged

pietern approved these changes Jun 18, 2019

View reviewed changes

facebook-github-bot reviewed Jun 18, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

123cf15

facebook-github-bot reviewed Jun 19, 2019

View reviewed changes

facebook-github-bot closed this in 08facca Jun 20, 2019

facebook-github-bot added the merged label Jun 20, 2019

apaszke reviewed Jun 21, 2019

View reviewed changes

apaszke added a commit that referenced this pull request Jun 21, 2019

Fix minor issues with #21736

97a40f7

facebook-github-bot pushed a commit that referenced this pull request Jun 24, 2019

Fix minor issues with #21736 (#22074)

f177579

Summary: cc mrshenli Pull Request resolved: #22074 Differential Revision: D15965376 Pulled By: mrshenli fbshipit-source-id: 50ff96de6390817d8ea52c04322c6bee3d649b32

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support accumulating DDP grads using a context manager #21736

Support accumulating DDP grads using a context manager #21736

mrshenli commented Jun 13, 2019

facebook-github-bot left a comment

mrshenli commented Jun 13, 2019

pietern left a comment

mrshenli commented Jun 17, 2019

pietern left a comment

mrshenli commented Jun 18, 2019

facebook-github-bot left a comment

mrshenli commented Jun 18, 2019

facebook-github-bot left a comment

facebook-github-bot commented Jun 20, 2019

apaszke Jun 21, 2019

apaszke Jun 21, 2019

apaszke commented Jun 21, 2019

Support accumulating DDP grads using a context manager #21736

Support accumulating DDP grads using a context manager #21736

Conversation

mrshenli commented Jun 13, 2019

Goal

Concerns

Proposed Solution

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli commented Jun 13, 2019

pietern left a comment

Choose a reason for hiding this comment

mrshenli commented Jun 17, 2019

pietern left a comment

Choose a reason for hiding this comment

mrshenli commented Jun 18, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli commented Jun 18, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 20, 2019

apaszke Jun 21, 2019

Choose a reason for hiding this comment

apaszke Jun 21, 2019

Choose a reason for hiding this comment

apaszke commented Jun 21, 2019