fix: CUDA error 710 bugfix #1424

gs-olive · 2022-10-28T17:25:51Z

Description

Resolves a CUDA 710 error Issue arising when compiling BERT models with 3+ inputs. The issue arises due to the role of the third tensor in inference computations. Specifically, as specified in the BERT model code linked here, the third argument, token_type_ids is of type torch.LongTensor, but can only take indices in $[0,1]$. This means that when values outside of this set are used, the input is invalid.

This becomes problematic when the inputs are, for example, indices in a dictionary or embedding - which seems to be the case here. Specifically, aten::embedding is used with Tensors which are the product of token_type_ids. The issue traces to one line in the shape_analysis code previewed below, which initializes a random tensor with values in the range $[0,4]$.

// shape_analysis.cpp [Line 23, Commit 5f3a5a3]
auto in = at::randint(5, shape, {at::kCUDA}).to(type);

This tensor is run through the forward function of the module to determine the shapes of outputs and causes the model compilation-time error, as featured here in the shape analysis code.

I have added a temporary fix by decreasing the range of allowed values to the random number generator for creating input tensors to 0-1, instead of 0-4, and am working on a more robust fix.

Fixes #1418

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)

Checklist:

[ x ] My code follows the style guidelines of this project (You can use the linters)
[ x ] I have performed a self-review of my own code
[ x ] I have commented my code, particularly in hard-to-understand areas and hacks
[ x ] I have made corresponding changes to the documentation
[ x ] I have added tests to verify my fix or my feature
[ x ] New and existing unit tests pass locally with my changes
[ x ] I have added the relevant labels to my PR in so that relevant reviewers are notified

core/partitioning/shape_analysis.cpp

narendasan · 2022-10-28T17:28:08Z

@bowang007 Make sure to review this

narendasan · 2022-10-28T17:28:46Z

From my perspective see nothing wrong with sampling between $[0,1)$

- Issue arising when compiling BERT models with 3+ inputs - Added temporary fix by decreasing the range of allowed values to the random number generator for creating input tensors to [0,2), instead of [0,5) - Used random float inputs in the range [0, 2) instead of int, then casted to desired type. The ultimate effect of this change with regard to bug pytorch#1418, is random floats are selected in the range [0, 2), then casted to Int, effectively making the range of allowed ints {0, 1}, as required by the model - More robust fix to follow

gs-olive · 2022-11-03T21:45:14Z

core/partitioning/shape_analysis.cpp

+
+  // Make the value range for input tensor a uniform (float) distribution
+  // over [LoValIncl, HiValExcl), then cast to the desired dtype
+  auto in = ((HiValExcl - LoValIncl) * at::rand(shape, {at::kCUDA}) + LoValIncl).to(type);


Used float inputs in the range $[LoValIncl, HiValExcl)$, then casted to the desired type to avoid divide-by-zero errors potentially arising from only selecting integer random values (even for float tensors). Currently, $LoValIncl = 0$ and $HiValExcl = 2$, but this will be made optionally user-customizeable in a later PR, as discussed in RFC #1425.

seems like a little bit hard-coded for this model only, but will be resolved once the input range is open to users by this RFC #1425.

bowang007

LGTM

facebook-github-bot added the cla signed label Oct 28, 2022

github-actions bot added component: core Issues re: The core compiler component: partitioning labels Oct 28, 2022

github-actions bot requested review from bowang007, narendasan and peri044 October 28, 2022 17:26

gs-olive commented Oct 28, 2022

View reviewed changes

core/partitioning/shape_analysis.cpp Outdated Show resolved Hide resolved

gs-olive added the release: v1.3 Tagged to be included in v1.3 label Nov 1, 2022

gs-olive self-assigned this Nov 1, 2022

gs-olive force-pushed the cuda_error_bugfix branch from 8f41d83 to 595b9f4 Compare November 3, 2022 21:40

gs-olive commented Nov 3, 2022

View reviewed changes

narendasan removed request for narendasan and peri044 November 7, 2022 17:27

ncomly-nvidia assigned bowang007 Nov 8, 2022

bowang007 approved these changes Nov 8, 2022

View reviewed changes

gs-olive merged commit 1951525 into pytorch:master Nov 9, 2022

gs-olive deleted the cuda_error_bugfix branch November 9, 2022 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CUDA error 710 bugfix #1424

fix: CUDA error 710 bugfix #1424

gs-olive commented Oct 28, 2022 •

edited

Loading

narendasan commented Oct 28, 2022

narendasan commented Oct 28, 2022

gs-olive Nov 3, 2022 •

edited

Loading

bowang007 Nov 8, 2022

bowang007 left a comment

fix: CUDA error 710 bugfix #1424

fix: CUDA error 710 bugfix #1424

Conversation

gs-olive commented Oct 28, 2022 • edited Loading

Description

Type of change

Checklist:

narendasan commented Oct 28, 2022

narendasan commented Oct 28, 2022

gs-olive Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

bowang007 Nov 8, 2022

Choose a reason for hiding this comment

bowang007 left a comment

Choose a reason for hiding this comment

gs-olive commented Oct 28, 2022 •

edited

Loading

gs-olive Nov 3, 2022 •

edited

Loading