-
Notifications
You must be signed in to change notification settings - Fork 23.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix C++ data parallel #20910
Fix C++ data parallel #20910
Conversation
Before this commit, C++ data parallel uses Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the bug by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.
test/cpp/api/parallel.cpp
Outdated
@@ -17,6 +20,13 @@ | |||
using namespace torch::autograd; | |||
using namespace torch::nn; | |||
|
|||
template <typename T> | |||
bool almost_equal(torch::Tensor left, torch::Tensor right, T tolerance = 1e-4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider using allclose
for this if it helps.
Just replicating here what I told @mrshenli in person: if the problem is that the two copies of Tensor in Module can get out of sync in C++ but not in Python, the correct solution is to make sure they stay in sync, one way or another. Taking a closer look now. |
Some clarification about how replicate.py works:
This is because there isn't actually a
|
After reminding myself about how parameter registration works in C++, I now agree that it is impossible to do assignment after clone. So I think that fixing up the cloned tensor directly is the right way to go. |
This is completely unrelated to this PR. But out of curiosity I checked c++ DP code. It seems that c++ DP is quite a bit slower than the python one because pytorch/torch/csrc/api/include/torch/nn/cloneable.h Lines 34 to 75 in 8cde4c4
Moreover, it is also incorrect with BNs (and other things that relies on in-place changes to buffers/parameters) because Edit: typo related -> unrelated |
namespace { | ||
|
||
// Note [Replicating Modules] | ||
// ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for a nice note, it is much appreciated!
"All inputs of ReduceAdd must have the same dtype, but got ", | ||
input.dtype(), " and ", inputs[0].dtype()); | ||
|
||
// TODO: use nccl reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this TODO mean that C++ data parallel is less efficient than Python data parallel at the moment? Seems like worth a big warning in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the the current version would be slower. Let me implement nccl reduce and remove the TODO here.
// Configure gradient edges to point from replcia parameters to original | ||
// module parameters. See [Replicating Modules] | ||
std::vector<std::shared_ptr<Module>> untyped_replicas( | ||
replicas.begin(), replicas.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why you have to allocate a vector here. Can't you just pass the iterators to the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is because of the way you handle children, because you concatenate the replicas of all sibling children (not just the current child) and now you don't have iterators. But that seems like a pretty unusual to write the recursion: normally you'd just recurse on child modules. Is there a reason it was done this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember I run into problems when collecting children modules on ling 126. Let me try if I can drop this untyped_replicas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can recurse on child modules by creating a map from *parameter
to grad_fn
, and retrieve that grad_fn
in every recursion.
// | ||
// ReduceAdd can ONLY be used during the backward pass of data parallel. Forward | ||
// pass cannot use this function as it does not setup gradient function and | ||
// history at all. Do NOT try to use ReduceAdd for any other purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support double backwards through replicate in Python? I don't see any indication in issues that we don't...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tried yet. Let me add a test.
If you want to land this to fix an urgent bug, I suppose I can be convinced to land it in its current state (I agree it's an improvement), but what I am confused about is why you had to implement ReduceAdd from scratch. Ostensibly, the Python implementation of DataParallel sets up some autograd backwards function, and to me, it seems like you ought to use whatever that backwards function was, and not make up a new one (this would also fix your TODO, if we are using nccl reduce in Python). |
Hang on, before you do that, read my overall comment please :)
Excerpts from mrshenli's message of 2019-05-29 08:57:48 -0700:
… mrshenli commented on this pull request.
> + autograd::variable_list apply(autograd::variable_list&& inputs) override {
+ TORCH_CHECK(!compute_requires_grad(inputs),
+ "ReduceAdd can only be used during the backward pass of data parallel.");
+
+ Tensor output = torch::zeros_like(inputs[0], {destination_device_});
+
+ for (auto& input: inputs) {
+ TORCH_CHECK(input.sizes() == inputs[0].sizes(),
+ "All inputs of ReduceAdd must have the same size, but got ",
+ input.sizes(), " and ", inputs[0].sizes());
+
+ TORCH_CHECK(input.dtype() == inputs[0].dtype(),
+ "All inputs of ReduceAdd must have the same dtype, but got ",
+ input.dtype(), " and ", inputs[0].dtype());
+
+ // TODO: use nccl reduce
Yes, the the current version would be slower. Let me implement nccl reduce and remove the TODO here.
|
Python's |
Yes, that seems like the right way to do it. If you want to apply this hotfix on master, I'll let it through, but ideally this should just all live in C++. |
Thanks. Let me first address smaller revision requests and create issues to track bigger ones. Will do that later today or tmr. I need to finish up sth else now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
The latest commit addressed @vishwakftw's comments on using TODOs are in #21144. @ezyang is the current solution acceptable? |
@pytorchbot rebase this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go ahead and land this for now. Looking up to the follow up where we line up the Python and C++ versions exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@pytorchbot rebase this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Failed test passed in rerun. |
Summary: As pointed out by SsnL in #20910, when clone destination is different from the module's device, `Cloneable` currently calls `clone()` and then `to()` on every parameter and buffer, where the first clone is unnecessary. Pull Request resolved: #20995 Differential Revision: D15517353 Pulled By: mrshenli fbshipit-source-id: 6b6dc01560540a63845663f863dea0a948021fa5
hi mrshenli! struct ConvModel : torch::nn::Module
} and i use torch::nn::parallel::replicate but i error clone() has not been implemented for ConvModel. Subclass torch::nn::Cloneable instead of torch::nn::Module to inherit the ability to clone if I try in herit torch::nn::Cloneable ,it have anthor error. can you give me some suggestion about it ?@mrshenli |
Fixes #19540
CC @nmerrill67
C++ data parallel was using Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the the problem by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.
Failed Attempt
Initially I tried implementing what we did in
replicate.py
, whichBroadcast
autograd function to broadcast every parameter in the original module to all destination devices._parameters
dict.This works in Python because derived module member field params (e.g.,
Linear.weight
) and base module_parameters
(e.g.,Linear._parameters['weight']
) are referencing the same parameter instance. Assigning one of them will apply to both. However, in C++, even though I can modify Module'sparameters_
values and gradient edges to point to the broadcast source, I cannot touch the weight and bias member fields in Linear, because replicate cannot (and should not) add special-case handlers to every different module. (SeeLinear
.h, .cpp) Although they initially point to the sameTensorImpl
instance, after assigning toModule.parameters_['weight']
, it will be different fromLinear.weight
.Solution Options
@gchanan and I had several discussions on this issue and figured two solutions to this problem.
Option One [implemented in this PR]
Replicate the module in two steps:
Module.clone()
to create a module replica on every destination device.Options Two
Implement a
Replicatable
class (similar toCloneable
), and make it a friend class ofModule
. For more details seeNote [Replicating Modules]
in the code change.Cloneable
?I am inclined to go with option one, because
replicate
will only be used on data parallel. I feel it is too big an overkill if we have to change all existing module implementations due to a data parallel requirement.