-
Notifications
You must be signed in to change notification settings - Fork 23.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes selection of cuDNN algorithm #15881
Conversation
|
||
// update convDesc mathType since cudnn now requires both algo + mathType to figure out | ||
// whether to use Tensor cores or not | ||
AT_CUDNN_CHECK(cudnnSetConvolutionMathType(args.cdesc.mut_desc(), fwdAlgPerf.mathType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused here: why don't you modify the convolution descriptor before calling chooseAlgorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because the calls to cudnnSetConvolutionMathType before chooseAlgorithm, doesn't matter. I needed to only set the convolution math type with the resultant math type of chooseAlgorithm instead of modifying anything else, and hence, deleted the calls to cudnnSetConvolutionMathType in the setter method of the descriptor and just used it directly after the chooseAlgorithm in all the subsequent tensors. Turns out, the mathType returned by the chooseAlgorithm can be different from what we set before in the setter and hence, we have to explicitly update it after the chooseAlgorithm has found the best pair of algorithm+mathType. Otherwise, even though we'll be calling cudnnConvolutionForward with the fastest algorithm, under the hood, cudnn will run it with the slower kernel since it sees fastest algorithm combination with a sub optimal mathType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is a very good comment to put in a Note [chooseAlgorithm doesn't respect mathType]
and reference from all of these sites :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added :)
aten/src/ATen/cudnn/Descriptors.h
Outdated
AT_CUDNN_CHECK(cudnnSetConvolutionGroupCount(mut_desc(), groups)); | ||
AT_CUDNN_CHECK(cudnnSetConvolutionMathType(mut_desc(), CUDNN_DEFAULT_MATH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, what I expected, was to see mathType added as an argument to this method, and then a set to cudnnSetConvolutionMathType here (instead of twiddling it in multiple locations further down this diff, where you could easily forget to put one site in.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so I finally actually reviewed the patch, and I understand why I am confused.
The invariant that I want to see out of this code, is that the default convolution descriptor (when I call set() on it) has "good" settings. By deleting these lines, we end up with a descriptor that is default everything, even though we know that in the absence of benchmarking, we do subsequently toggle CUDNN_TENSOR_OP_MATH
on for half inputs. So, to me, it seems like it would be better to set this here, and then delete the code that does the equivalent modification at
pytorch/aten/src/ATen/native/cudnn/Conv.cpp
Line 706 in fc574b5
algoPerf->algo = search::DEFAULT_ALGO; |
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally doable. I was debating between whether to call cudnnSetConvolutionMathType multiple times vs putting those extra lines regarding default mathtype and algo. But since cudnnSetConvolutionMathType is cheap, let's bring those lines back and get rid of the multiple code paths regarding mathType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the latest changes (014b7b9). I think it addresses both of our confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am still confused. In the latest patch, you are still not setting cudnnSetConvolutionMathType
inside ConvolutionDescriptor
, instead, you're returning cudnnMathType_t
via pointer argument, which is weird and pretty out of line with how the rest of the descriptors work.
Also, I don't understand why it is necessary to pre-fill fwdAlgPerf.mathType
with a math type, in your code. The invariant I would expect is chooseAlgorithm
always leaves fwdAlgPerf
in a valid state. Is the problem this code?
if (args.params.deterministic && !benchmark) {
*algo = search::DEFAULT_ALGO;
return;
}
I don't know, but I could imagine that the DEFAULT_ALGO
here isn't half aware and so you get the wrong math type in that case? Sure, but in that case, I would expect up the fixup for half case to occur here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot call cudnnSetConvolutionMathType in the descriptor's setter, because the cudnnSetConvolutionMathType called after chooseAlgorithm would overwrite the effect of the one in the descriptor's setter, and it would overwrite it with garbage for the default path if I don't have the following in chooseAlgorithm.
No, I don't agree with this. Look at the code here:
args.cdesc.set(dataType, input.dim() - 2, args.params.padding, args.params.stride, args.params.dilation, args.params.groups);
// TODO: when we do legacy group convolution support, we'll repeatedly
// reinitialize the workspace for each convolution we do. This is
// wasteful; we'd rather reuse the workspace. OTOH, legacy group
// convolution support is already pretty slow, so this might not
// matter. (This applies to raw_cudnn_convolution_backward_input as well.)
cudnnConvolutionFwdAlgo_t fwdAlg;
Workspace workspace = chooseAlgorithm(args, benchmark, &fwdAlg);
The set
method on descriptor is clearly called before chooseAlgorithm
. I additionally audited every other occurrence of set(
in native/cudnn/Conv.cpp to convince myself that ConvolutionDescriptor::set is always called before chooseAlgorithm in all codepaths.
I certainly agree with you that in the event that benchmarking occurs, we may need to subsequently call cudnnSetConvolutionMathType after benchmarking, to update it (thus leading to an extra, technically unnecessary, setting of math type.) For interest of code clarity, I am recommending that you do this extra work.
I didn't respond to your paragraphs after this claim. Let me know if I should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I prepared a diff for you based on your most recent reply (syed-ahmed@0499daa) but I'm embarrassed to say that I've discovered a cudnn bug! :'(
The bug is, if I call cudnnSetConvolutionMathType in the setter of the descriptor (which sets the mathType to CUDNN_TENSOR_OP when fp16), cudnnGet*_v7 returns algo1 with CUDNN_TENSOR_OP math type, instead of not caring about what was set by cudnnSetConvolutionMathType before it (and returning algo1 with CUDNN_DEFAULT_MATH which is performant)!
Any suggestions on how can I move forward? I have filed a bug internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Sorry about sending you on this wild goose chase!
In that case, I think something similar to an earlier version of your patch would be preferred (before you started changing it due to my comments). Here's the new justification:
- The job of
ConvolutionDescriptor::set
is to setup a convolution descriptor so that algorithm search works for it. Since setting math type toCUDNN_TENSOR_OP
prevents cuDNN from considering non-tensor-op algorithms, that means it's correct for it to unconditionally setCUDNN_DEFAULT_MATH
- The job of
findAlgorithm
is to update ConvolutionDescriptor with correct, performant parameters, even if we never actually call the cudnn benchmarking function. And as the code is structured right now, you'll have to write the code to adjust the math setting in multiple places. I guess it's not worth fixing now.
So... maybe just reverting to e5038f9 and adding some comments from our investigation is sufficient? Once again, sorry about the wild goose chase!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no no. It's all good. Your points are legit. I'll put the comments in for now and hopefully we can address this again in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. :)
Yes please! |
20647b8
to
fc574b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions. Changelist: - Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`. - Used cudnnGet*_v7 everywhere cudnnGet* was being used. - Removed all cudnn6 paths (This PR depends on #15851) Differential Revision: D13787601 Pulled By: ezyang fbshipit-source-id: 81fe86727673d021306fe1c99c3e528b7c9ad17f
Had this some impact on float() algo selection other than the half() example? |
@bhack There wasn't any impact on float() because we were doing cudnnSetConvolutionMathType with CUDNN_DEFAULT_MATH for float() case and cudnnGet/Find was giving algo1. As a result, even though we weren't calling cudnnSetConvolutionMathType to update the cdesc mathType after cudnnGet/Find, we were still using the performant algo+mathType combo. Following is a before and after nvprof for float() for your reference. |
This reverts commit 2ebb8fd.
This reverts commit 3837446.
Summary: There is a regression in cudnnGet*_v7 that causes slowdown in resnet50 training. I am opening a bug with cuDNN team about this. This reverts commit 3837446. ezyang 😿 Pull Request resolved: #16484 Differential Revision: D13924755 Pulled By: soumith fbshipit-source-id: 8c719345fc443f1289539bfae630eea9224ba4a5
1a35ceb
to
c621e9a
Compare
aten/src/ATen/native/cudnn/Conv.cpp
Outdated
// update convDesc mathType since cudnn 7.4+ now requires both algo + mathType to figure out | ||
// whether to use Tensor core kernels or not | ||
// See Note [behavior of cudnnFind and cudnnGet] | ||
AT_CUDNN_CHECK(cudnnSetConvolutionMathType(args.cdesc.mut_desc(), perfResults[best_algo_idx].mathType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work. This will set correct math type only the first time, when you are actually running find. All the subsequent times cdesc will be created with the default math type (default for fp32, tensor cores for fp16), and it won't ever be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I see what you are saying. I was trying to address @ezyang 's request on removing the following code blocks and this can either stay like this or I could make the setter take perf_t
and do this there once.
if (dataType == CUDNN_DATA_HALF) {
algoPerf->mathType = CUDNN_TENSOR_OP_MATH;
} else {
algoPerf->mathType = CUDNN_DEFAULT_MATH;
}
Co-authored-by: Natalia Gimelshein <ngimelshein@nvidia.com>
c621e9a
to
436aea6
Compare
// When cudnnSetConvolutionMathType is called before cudnnGet/cudnnFind, it informs | ||
// cudnnGet/cudnnFind to iterate/take into account both tensor core and non-tensor-core algos. | ||
// If you don't call cudnnSetConvolutionMathType before calling cudnnGet/cudnnFind, | ||
// cudnnGet/cudnnFind may not pick tensor core algos. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to accept this patch as is, but one thing I don't quite get from the description here is whether or not the MATH
parameter you pass in matters or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter as it turns out. Should I word this differently (I don't think one can decipher this from the cuDNN docs either)? For instance, before this patch, since I didn't pass CUDNN_TENSOR_OP_MATH, cudnnFind/cudnnGet was not picking the algorithms that enable tensor cores.
Here are the resnet50 numbers with top of tree. Will update with MaskRCNN numbers as soon as I get them:
|
I attempted to read the algo selection code a few times this weekend, and just gave up lol. I'm going to just go ahead and merge this and trust you guys ;) |
@ezyang but you wrote it, or at least moved it from the previous incarnation of cudnn bindings :-) |
I plead copy paste! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions. Changelist: - Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`. - Used cudnnGet*_v7 everywhere cudnnGet* was being used. - Removed all cudnn6 paths (This PR depends on #15851) Differential Revision: D13957944 Pulled By: ezyang fbshipit-source-id: a88c39d80ae37f2d686665622302b62b50fab404
Summary:
This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.
Changelist:
perf_t
instead ofalgo_t
to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid ofgetAlgorithm
.CC: @ngimel @csarofeen for review
Nvprof:
Timing:
Code used in analysis: