-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support aten._cdist_forward converter #2726
Conversation
x1: TRTTensor, | ||
x2: TRTTensor, | ||
p: float, | ||
compute_mode: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that compute_mode
is not used below. Is there any difficulty for this? Ideally, we should mimic all pytorch's behaviors (i.e., receiving the same args and then returning the same outputs as pytorch, based on the schema).
If compute_mode
cannot be supported for some reasons, we need to use capability_validator
to make sure this kind of cases will not be converted, instead, falling back to pytorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute_mode
does not influence the output values. It is designed for optimizing computational efficiency in specific situations when calculating the Euclidean distance (p=2) for large-sized inputs.
I have modified to clarify:
- A note has been added that
compute_mode
doesn't alter the computational path in our implementation. - A default value of 0 is used for
compute_mode
when it's unspecified, maintaining behavior consistency with Pytorch. - I have included a warning for situations where an optimized path for p=2 might be expected but isn't utilized due to our unified approach (element-wise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! I'm still thinking if using matmul
will be faster for the previous two scenarios? Considering that we have out-of-the-box matmul, if possible, could you compare their time costs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable comment. In this update, I've implemented a matrix multiplication-based approach to compute the distance for p=2
. During my tests on my local PC with inputs x1=(150,100,50,50)
and x2=(150,100,30,50)
, I observed that the TRT run time improved by approximately 6.3 times
. Here are the relevant logs from those tests:
INFO:harness:FX graph= graph():
%x1 : [num_users=1] = placeholder[target=x1]
%x2 : [num_users=1] = placeholder[target=x2]
%_cdist_forward_default : [num_users=1] = call_function[target=torch.ops.aten._cdist_forward.default](args = (%x1, %x2, 2, 0), kwargs = {})
return _cdist_forward_default
===============based on matmul : ==================
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.007999
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py:255: DeprecationWarning: Use build_serialized_network instead.
engine = self.builder.build_engine(self.ctx.net, builder_config)
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:33.710717
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 244801024 bytes of Memory
INFO:harness:Interpreter run time(s): 33.719337898997765
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:63: DeprecationWarning: Use get_tensor_name instead.
self.engine.get_binding_index(name) for name in self.input_names
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:66: DeprecationWarning: Use get_tensor_name instead.
self.engine.get_binding_index(name) for name in self.output_names
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:88: DeprecationWarning: Use get_tensor_dtype instead.
self.engine.get_binding_dtype(idx), Frameworks.TORCH
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:93: DeprecationWarning: Use get_tensor_shape instead.
tuple(self.engine.get_binding_shape(idx))
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:98: DeprecationWarning: Use get_tensor_dtype instead.
self.engine.get_binding_dtype(idx), Frameworks.TORCH
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:247: DeprecationWarning: Use set_input_shape instead.
self.context.set_binding_shape(
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:262: DeprecationWarning: Use get_tensor_shape instead.
shape = tuple(self.context.get_binding_shape(idx))
INFO:harness:TRT run time(s)= 0.0026535038948059084
.INFO:harness:FX graph= graph():
%x1 : [num_users=1] = placeholder[target=x1]
%x2 : [num_users=1] = placeholder[target=x2]
%_cdist_forward_default : [num_users=1] = call_function[target=torch.ops.aten._cdist_forward.default](args = (%x1, %x2, 2, 1), kwargs = {})
return _cdist_forward_default
===============based on elementwise pow for diff : ==================
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.003120
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py:255: DeprecationWarning: Use build_serialized_network instead.
engine = self.builder.build_engine(self.ctx.net, builder_config)
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:43.902498
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 4590000640 bytes of Memory
INFO:harness:Interpreter run time(s): 43.90606521600421
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:63: DeprecationWarning: Use get_tensor_name instead.
self.engine.get_binding_index(name) for name in self.input_names
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:66: DeprecationWarning: Use get_tensor_name instead.
self.engine.get_binding_index(name) for name in self.output_names
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:88: DeprecationWarning: Use get_tensor_dtype instead.
self.engine.get_binding_dtype(idx), Frameworks.TORCH
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:93: DeprecationWarning: Use get_tensor_shape instead.
tuple(self.engine.get_binding_shape(idx))
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:98: DeprecationWarning: Use get_tensor_dtype instead.
self.engine.get_binding_dtype(idx), Frameworks.TORCH
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:247: DeprecationWarning: Use set_input_shape instead.
self.context.set_binding_shape(
/home/hoonkyungc/workspace/TorchTRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py:262: DeprecationWarning: Use get_tensor_shape instead.
shape = tuple(self.context.get_binding_shape(idx))
INFO:harness:TRT run time(s)= 0.016830976486206056
.
----------------------------------------------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems matmul
runs faster in both compilation stage and running stage, and saves a lot of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. While always using matrix multiplication might seem more efficient considering TensorRT's capabilities, I've aligned the implementation with the compute mode of the ATen operations for consistency. Thank you for pointing this out!
print("x1 : ", x1) | ||
print("x2 : ", x2) | ||
|
||
return torch.ops.aten._cdist_forward.default(x1, x2, p, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different compute_mode
s other than None
should be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made modifications to allow testing for different compute_mode
values from the input of test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good - added a few suggestions
x1=args[0], | ||
x2=args[1], | ||
p=args[2], | ||
compute_mode=args[3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the schema having int?
as the type for compute_mode
, this should be compute_mode = args_bounds_check(args, 2, None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! I've made the requested changes (compute_mode = args_bounds_check(args, 3, None)
)
x1: TRTTensor, | ||
x2: TRTTensor, | ||
p: float, | ||
compute_mode: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch to Optional[int]
, since None
is a valid choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment! I've made the requested changes.
else: | ||
raise NotImplementedError(f"Currently, p={p} is not implemented.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed, since it seems you have covered all of the valid intervals by Torch's definition of cdist
, in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! I've made the requested changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py 2024-04-29 14:43:08.889448+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py 2024-04-29 14:44:58.684466+00:00
@@ -549,11 +549,11 @@
- If x1.shape = [2, 3, 10, 5] and x2.shape = [2, 3, 20, 5], both having the same batch dimensions [2, 3], the output shape will be [2, 3, 10, 20].
This represents computing distances in two batches of three groups, each comparing 10 vectors from x1 with 20 vectors from x2.
- For x1.shape = [10, 5] (10 vectors, each of 5 features) and x2.shape = [20, 5] (20 vectors, each of 5 features),
since there are no batch dimensions to match, the output shape is simply [10, 20], comparing all vectors from x1 against all vectors from x2.
- Note: The `compute_mode` parameter is designed to optimize the performance of the Euclidean distance calculation, especially useful when working with large datasets.
+ Note: The `compute_mode` parameter is designed to optimize the performance of the Euclidean distance calculation, especially useful when working with large datasets.
This parameter allows you to control how the distances are computed, with different modes available to leverage matrix multiplication for speed improvements.
"""
if compute_mode is None:
compute_mode = 0
@chohk88 The CI failed. Can you take a look? |
1962cf8
to
9f901f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chohk88 It seems the tests have too large dataset, which caused OOM error in CI. My advice is to turn them down but make sure every condition path would be covered.
@zewenli98 Thank you for your comments! The issue with CI/CD failures due to the dataset size has been resolved. However, while the Windows wheel build has completed successfully, the Linux wheel build is still failing. |
It seems to work now. LGTM. |
Description
New feature to support aten._cdist_forward converter converter. I have added test cases ensuring compatibility with both matching and broadcasting input shapes.
Fixes # (issue)
Type of change
Checklist: