Fix C++ data parallel #20910

mrshenli · 2019-05-24T15:55:02Z

Fixes #19540

C++ data parallel was using Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the the problem by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.

Failed Attempt

Initially I tried implementing what we did in replicate.py, which

create module replicas
use Python Broadcast autograd function to broadcast every parameter in the original module to all destination devices.
assign the broadcast result params to module replicas' _parameters dict.

This works in Python because derived module member field params (e.g., Linear.weight) and base module _parameters (e.g., Linear._parameters['weight']) are referencing the same parameter instance. Assigning one of them will apply to both. However, in C++, even though I can modify Module's parameters_ values and gradient edges to point to the broadcast source, I cannot touch the weight and bias member fields in Linear, because replicate cannot (and should not) add special-case handlers to every different module. (See Linear .h, .cpp) Although they initially point to the same TensorImpl instance, after assigning to Module.parameters_['weight'], it will be different from Linear.weight.

Solution Options

@gchanan and I had several discussions on this issue and figured two solutions to this problem.

Option One [implemented in this PR]

Replicate the module in two steps:

call Module.clone() to create a module replica on every destination device.
manually setting gradient edges from every parameter in every replica to the same parameter in the original module.

Pro: Does not need to change any existing module, and relatively easier to implement
Con: It is a little hackish.

Options Two

Implement a Replicatable class (similar to Cloneable), and make it a friend class of Module. For more details see Note [Replicating Modules] in the code change.

Pro: Maybe this aligns more with our existing approach implemented in Cloneable?
Con: Require changes to every existing module.

I am inclined to go with option one, because replicate will only be used on data parallel. I feel it is too big an overkill if we have to change all existing module implementations due to a data parallel requirement.

Before this commit, C++ data parallel uses Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the bug by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.

vishwakftw · 2019-05-24T16:08:12Z

test/cpp/api/parallel.cpp

@@ -17,6 +20,13 @@
 using namespace torch::autograd;
 using namespace torch::nn;

+template <typename T>
+bool almost_equal(torch::Tensor left, torch::Tensor right, T tolerance = 1e-4) {


You can consider using allclose for this if it helps.

ezyang · 2019-05-24T19:04:58Z

Just replicating here what I told @mrshenli in person: if the problem is that the two copies of Tensor in Module can get out of sync in C++ but not in Python, the correct solution is to make sure they stay in sync, one way or another.

Taking a closer look now.

ezyang · 2019-05-24T19:38:56Z

Some clarification about how replicate.py works:

This works in Python because derived module member field params (e.g., Linear.weight) and base module _parameters (e.g., Linear._parameters['weight']) are referencing the same parameter instance. Assigning one of them will apply to both.

This is because there isn't actually a Linear.weight. It is provided by the __getattr__ magic method on Module:

    def __getattr__(self, name):
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]

ezyang · 2019-05-24T19:54:42Z

After reminding myself about how parameter registration works in C++, I now agree that it is impossible to do assignment after clone. So I think that fixing up the cloned tensor directly is the right way to go.

ssnl · 2019-05-27T05:14:58Z

This is completely unrelated to this PR. But out of curiosity I checked c++ DP code. It seems that c++ DP is quite a bit slower than the python one because module.clone() does .clone() before .to(*device)

pytorch/torch/csrc/api/include/torch/nn/cloneable.h

Lines 34 to 75 in 8cde4c4

    
           std::shared_ptr<Module> clone( 
        
               const optional<Device>& device = nullopt) const override { 
        
             NoGradGuard no_grad; 
        
             const auto& self = static_cast<const Derived&>(*this); 
        
             auto copy = std::make_shared<Derived>(self); 
        
             copy->parameters_.clear(); 
        
             copy->buffers_.clear(); 
        
             copy->children_.clear(); 
        
             copy->reset(); 
        
             TORCH_CHECK( 
        
                 copy->parameters_.size() == parameters_.size(), 
        
                 "The cloned module does not have the same number of " 
        
                 "parameters as the original module after calling reset(). " 
        
                 "Are you sure you called register_parameter() inside reset() " 
        
                 "and not the constructor?"); 
        
             for (const auto& parameter : parameters_) { 
        
               auto data = autograd::Variable(*parameter).clone(); 
        
               copy->parameters_[parameter.key()].set_data( 
        
                   device ? data.to(*device) : data); 
        
             } 
        
             TORCH_CHECK( 
        
                 copy->buffers_.size() == buffers_.size(), 
        
                 "The cloned module does not have the same number of " 
        
                 "buffers as the original module after calling reset(). " 
        
                 "Are you sure you called register_buffer() inside reset() " 
        
                 "and not the constructor?"); 
        
             for (const auto& buffer : buffers_) { 
        
               auto data = autograd::Variable(*buffer).clone(); 
        
               copy->buffers_[buffer.key()].set_data(device ? data.to(*device) : data); 
        
             } 
        
             TORCH_CHECK( 
        
                 copy->children_.size() == children_.size(), 
        
                 "The cloned module does not have the same number of " 
        
                 "child modules as the original module after calling reset(). " 
        
                 "Are you sure you called register_module() inside reset() " 
        
                 "and not the constructor?"); 
        
             for (const auto& child : children_) { 
        
               copy->children_[child.key()]->clone_(*child.value(), device); 
        
             } 
        
             return copy; 
        
           }

Moreover, it is also incorrect with BNs (and other things that relies on in-place changes to buffers/parameters) because module.clone always deepcopies, i.e., the buffers and parameters are always cloned for each replica, even if the source module lives on the correct device. This seems to me that it would be impossible to correctly perform BN.

Edit: typo related -> unrelated

mrshenli · 2019-05-27T20:12:06Z

@ssnl thanks for the catch. I added #20995 to remove the unnecessary clone. For BN use cases, let me revise this PR and add a test for it.

ezyang · 2019-05-28T20:54:07Z

torch/csrc/api/include/torch/nn/parallel/data_parallel.h

+namespace {
+
+// Note [Replicating Modules]
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~


Thank you for a nice note, it is much appreciated!

ezyang · 2019-05-29T15:53:15Z

torch/csrc/api/include/torch/nn/parallel/data_parallel.h

+          "All inputs of ReduceAdd must have the same dtype, but got ",
+          input.dtype(), " and ", inputs[0].dtype());
+
+      // TODO: use nccl reduce


Does this TODO mean that C++ data parallel is less efficient than Python data parallel at the moment? Seems like worth a big warning in that case.

Yes, the the current version would be slower. Let me implement nccl reduce and remove the TODO here.

ezyang · 2019-05-29T15:54:04Z

torch/csrc/api/include/torch/nn/parallel/data_parallel.h

+  // Configure gradient edges to point from replcia parameters to original
+  // module parameters. See [Replicating Modules]
+  std::vector<std::shared_ptr<Module>> untyped_replicas(
+    replicas.begin(), replicas.end());


I don't see why you have to allocate a vector here. Can't you just pass the iterators to the function?

I guess this is because of the way you handle children, because you concatenate the replicas of all sibling children (not just the current child) and now you don't have iterators. But that seems like a pretty unusual to write the recursion: normally you'd just recurse on child modules. Is there a reason it was done this way?

I remember I run into problems when collecting children modules on ling 126. Let me try if I can drop this untyped_replicas .

I think I can recurse on child modules by creating a map from *parameter to grad_fn, and retrieve that grad_fn in every recursion.

ezyang · 2019-05-29T16:00:19Z

torch/csrc/api/include/torch/nn/parallel/data_parallel.h

+//
+// ReduceAdd can ONLY be used during the backward pass of data parallel. Forward
+// pass cannot use this function as it does not setup gradient function and
+// history at all. Do NOT try to use ReduceAdd for any other purposes.


Do we support double backwards through replicate in Python? I don't see any indication in issues that we don't...

Haven't tried yet. Let me add a test.

ezyang · 2019-05-29T16:02:08Z

If you want to land this to fix an urgent bug, I suppose I can be convinced to land it in its current state (I agree it's an improvement), but what I am confused about is why you had to implement ReduceAdd from scratch. Ostensibly, the Python implementation of DataParallel sets up some autograd backwards function, and to me, it seems like you ought to use whatever that backwards function was, and not make up a new one (this would also fix your TODO, if we are using nccl reduce in Python).

ezyang · 2019-05-29T16:02:52Z

Hang on, before you do that, read my overall comment please :) Excerpts from mrshenli's message of 2019-05-29 08:57:48 -0700:

…

mrshenli commented on this pull request. > + autograd::variable_list apply(autograd::variable_list&& inputs) override { + TORCH_CHECK(!compute_requires_grad(inputs), + "ReduceAdd can only be used during the backward pass of data parallel."); + + Tensor output = torch::zeros_like(inputs[0], {destination_device_}); + + for (auto& input: inputs) { + TORCH_CHECK(input.sizes() == inputs[0].sizes(), + "All inputs of ReduceAdd must have the same size, but got ", + input.sizes(), " and ", inputs[0].sizes()); + + TORCH_CHECK(input.dtype() == inputs[0].dtype(), + "All inputs of ReduceAdd must have the same dtype, but got ", + input.dtype(), " and ", inputs[0].dtype()); + + // TODO: use nccl reduce Yes, the the current version would be slower. Let me implement nccl reduce and remove the TODO here.

mrshenli · 2019-05-29T17:42:38Z

@ezyang

but what I am confused about is why you had to implement ReduceAdd from scratch. Ostensibly, the Python implementation of DataParallel sets up some autograd backwards function, and to me, it seems like you ought to use whatever that backwards function was, and not make up a new one (this would also fix your TODO, if we are using nccl reduce in Python).

Python's DataParallel uses Broadcast and ReduceAddCoalesced functions which are implemented in Python. How about letting me create a new PR to move these two functions to C++?

ezyang · 2019-05-29T20:43:33Z

Yes, that seems like the right way to do it. If you want to apply this hotfix on master, I'll let it through, but ideally this should just all live in C++.

mrshenli · 2019-05-29T20:47:48Z

If you want to apply this hotfix on master, I'll let it through, but ideally this should just all live in C++.

Thanks. Let me first address smaller revision requests and create issues to track bigger ones. Will do that later today or tmr. I need to finish up sth else now.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-05-30T15:09:49Z

The latest commit addressed @vishwakftw's comments on using allclose and @ezyang's comments on removing untyped_replicas. I didn't modify the recursion strategy, because after a second thought I think the current one is easier to understand and cleaner, i.e., recurse based on the original module's structure instead of creating a big map from all params/buffers to grad_fns upfront and then recurse on every original-replica pair. The current solution to the problem is adding a typename ModuleType to replicate_grad_edges. The template is only useful in the very first invocation, and ModuleType becomes Module in all subsequent invocations.

TODOs are in #21144.

@ezyang is the current solution acceptable?

ezyang · 2019-06-05T15:28:38Z

@pytorchbot rebase this please

ezyang

Let's go ahead and land this for now. Looking up to the follow up where we line up the Python and C++ versions exactly.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-06-05T19:52:32Z

@pytorchbot rebase this please

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli · 2019-06-06T17:46:27Z

Failed test passed in rerun.

facebook-github-bot · 2019-06-07T02:36:00Z

@mrshenli merged this pull request in b7b6b61.

Summary: As pointed out by SsnL in #20910, when clone destination is different from the module's device, `Cloneable` currently calls `clone()` and then `to()` on every parameter and buffer, where the first clone is unnecessary. Pull Request resolved: #20995 Differential Revision: D15517353 Pulled By: mrshenli fbshipit-source-id: 6b6dc01560540a63845663f863dea0a948021fa5

yulinhuyang · 2021-03-12T09:15:55Z

hi mrshenli！
I use my module like this in C++ :

struct ConvModel : torch::nn::Module
{
torch::Tensor forward(at::Tensor gallery)
{

auto result = torch::nn::functional::conv1d(gallery, query_weight);
return result;

}

and i use torch::nn::parallel::replicate

but i error

clone() has not been implemented for ConvModel. Subclass torch::nn::Cloneable instead of torch::nn::Module to inherit the ability to clone

if I try in herit torch::nn::Cloneable ,it have anthor error. can you give me some suggestion about it ?@mrshenli

mrshenli requested review from pietern, ezyang, soumith, gchanan, yf225 and goldsborough May 24, 2019 15:55

mrshenli requested a review from ebetica as a code owner May 24, 2019 15:55

pytorchbot added the module: cpp Related to C++ API label May 24, 2019

mrshenli mentioned this pull request May 24, 2019

C++ Frontend data_parallel Does Not Update Weights #19540

Closed

vishwakftw reviewed May 24, 2019

View reviewed changes

mrshenli mentioned this pull request May 27, 2019

Avoid unnecessary tensor clone in Cloneable #20995

Closed

ezyang reviewed May 28, 2019

View reviewed changes

ezyang reviewed May 29, 2019

View reviewed changes

Address comments

25840e6

facebook-github-bot reviewed May 30, 2019

View reviewed changes

mrshenli mentioned this pull request May 30, 2019

Data Parallel Implementation Improvements #21144

Open

6 tasks

Merge remote-tracking branch 'origin/master' into HEAD

060259d

ezyang approved these changes Jun 5, 2019

View reviewed changes

facebook-github-bot reviewed Jun 5, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

fc45f09

facebook-github-bot reviewed Jun 6, 2019

View reviewed changes

facebook-github-bot closed this in b7b6b61 Jun 6, 2019

facebook-github-bot added the merged label Jun 7, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix C++ data parallel #20910

Fix C++ data parallel #20910

mrshenli commented May 24, 2019

vishwakftw May 24, 2019

ezyang commented May 24, 2019

ezyang commented May 24, 2019

ezyang commented May 24, 2019

ssnl commented May 27, 2019 •

edited

Loading

mrshenli commented May 27, 2019

ezyang May 28, 2019

ezyang May 29, 2019

mrshenli May 29, 2019

ezyang May 29, 2019

ezyang May 29, 2019

mrshenli May 29, 2019

mrshenli May 29, 2019

ezyang May 29, 2019

mrshenli May 29, 2019

ezyang commented May 29, 2019

ezyang commented May 29, 2019 via email

mrshenli commented May 29, 2019 •

edited

Loading

ezyang commented May 29, 2019

mrshenli commented May 29, 2019

facebook-github-bot left a comment

mrshenli commented May 30, 2019

ezyang commented Jun 5, 2019

ezyang left a comment

facebook-github-bot left a comment

mrshenli commented Jun 5, 2019

facebook-github-bot left a comment

mrshenli commented Jun 6, 2019

facebook-github-bot commented Jun 7, 2019

yulinhuyang commented Mar 12, 2021

Fix C++ data parallel #20910

Fix C++ data parallel #20910

Conversation

mrshenli commented May 24, 2019

Failed Attempt

Solution Options

Option One [implemented in this PR]

Options Two

Choose a reason for hiding this comment

ezyang commented May 24, 2019

ezyang commented May 24, 2019

ezyang commented May 24, 2019

ssnl commented May 27, 2019 • edited Loading

mrshenli commented May 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented May 29, 2019

ezyang commented May 29, 2019 via email

mrshenli commented May 29, 2019 • edited Loading

ezyang commented May 29, 2019

mrshenli commented May 29, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli commented May 30, 2019

ezyang commented Jun 5, 2019

ezyang left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli commented Jun 5, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli commented Jun 6, 2019

facebook-github-bot commented Jun 7, 2019

yulinhuyang commented Mar 12, 2021

ssnl commented May 27, 2019 •

edited

Loading

mrshenli commented May 29, 2019 •

edited

Loading