-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Data Dependent Shape (DDS) and NonZero op #3364
base: main
Are you sure you want to change the base?
Conversation
2ee9299
to
3b60296
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -524,12 +537,26 @@ def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, . | |||
|
|||
self._caller_stream.wait_stream(self._engine_stream) | |||
|
|||
if self.use_pre_allocated_outputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to check with @keehyuna on how this change would affect the use_pre_allocated_outputs optimization that he implemented ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think we can directly remove pre_allocated_outputs
related arguments because we don't have to manually allocate memory to outputs. Instead, reallocate_output_async
fuunction will take care of everything and automatically reuse pre_allocated_outputs if possible. @keehyuna Can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pre_allocated_outputs
is not required if allocation is properly reused. I tried to check the behavior but there was "Segmentation fault" during build_serialized_network()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added nvtx event in PythonTorchTensorRTModule:TensorRTRuntime and reallocate_output_async.
They are running concurrently when I test resnet model. It seems pre_allocated_outputs is not required and async output will have benefit in most cases. I will check perf in more cases. I think this pr can go ahead as use_pre_allocated_outputs is disabled by default. I will revert pre-allocated output feature after some more test.
BTW, will cpp runtime have async output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm working on cpp runtime but it may take a couple of days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @keehyuna I have updated the cpp runtime. Can you confirm if pre_allocated_outputs
is not needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zewenli98 , I checked cpp runtime, reallocateOutputAsync and ExecutionContext are running concurrently as well. I will clean up remaining code of pre_allocated_outputs .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keehyuna Thanks for confirming. I have cleared up some pre_allocated_outputs
related codes by the way. Can you take a look if that's complete? If anything else should be removed please let me know!
2846ef7
to
605bba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2025-02-04 08:16:20.457716+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py 2025-02-04 08:16:43.664304+00:00
@@ -546,11 +546,15 @@
for o, output_name in enumerate(self.output_names):
assert self.output_allocator is not None
shape = self.output_allocator.shapes.get(output_name, None)
self.output_shapes[o] = shape
dtype = self.output_dtypes[o]
- output = self.output_allocator.buffers.get(output_name, None).clone().detach()
+ output = (
+ self.output_allocator.buffers.get(output_name, None)
+ .clone()
+ .detach()
+ )
prod = int(torch.prod(torch.tensor(shape)))
output = output.reshape(-1).view(dtype)[:prod].reshape(shape)
outputs.append(output)
if len(outputs) == 1:
return outputs; | ||
for (const auto& output_name : compiled_engine->out_binding_names) { | ||
if (!compiled_engine->exec_ctx->setOutputAllocator(output_name.c_str(), compiled_engine->output_allocator.get())) { | ||
throw std::runtime_error("Failed to set output allocator for " + output_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the error utilities we already have like TORCHTRT_THROW_ERROR
"Error while setting the output tensor address"); | ||
} | ||
} | ||
setup_output_allocator(compiled_engine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do this everytime the engine is going to run?
} else if (outputs.size() > 0) { | ||
current_device_id = outputs[0].device().index(); // Done this way to avoid a call to cudart | ||
} else { | ||
current_device_id = c10::cuda::current_device(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to call cudart right? That will incur a significant cost
auto name = compiled_engine->out_binding_names[i]; | ||
auto dims = compiled_engine->output_allocator->getShapes().at(name); | ||
auto dtype = util::TRTDataTypeToScalarType(compiled_engine->exec_ctx->getEngine().getTensorDataType(name.c_str())); | ||
at::Tensor output = compiled_engine->output_allocator->getBuffers().at(name).clone().detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not directly copy to the result buffers like before instead of allocating new memory here? or does it not make a difference.
} | ||
output = output.reshape(-1).view(dtype).slice(0, 0, prod).reshape(dims_vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that the Output allocator needs to trigger rerecording of the cuda graph when the buffers change for each runtime
if (it == buffers.end() || it->second.sizes() != shape) { | ||
buffers[tensorName] = at::empty(shape, at::TensorOptions().dtype(dtypes.at(tensorName)).device(c10::kCUDA)); | ||
} | ||
return buffers[tensorName].data_ptr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are paying the index cost twice here? just return it.second in the else and return a handle to the new empty tensor after you insert it in the first
std::vector<int64_t> shape = {static_cast<int64_t>(size)}; | ||
auto it = buffers.find(tensorName); | ||
if (it == buffers.end() || it->second.sizes() != shape) { | ||
buffers[tensorName] = at::empty(shape, at::TensorOptions().dtype(dtypes.at(tensorName)).device(c10::kCUDA)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know what device to use?
std::vector<int64_t> shape = {static_cast<int64_t>(size)}; | ||
auto it = buffers.find(tensorName); | ||
if (it == buffers.end() || it->second.sizes() != shape) { | ||
buffers[tensorName] = at::empty(shape, at::TensorOptions().dtype(dtypes.at(tensorName)).device(c10::kCUDA)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as this happens the recapture flag needs to be set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more change should be removed.
core/runtime/TRTEngine.cpp
https://github.com/pytorch/TensorRT/pull/3276/files#diff-70fc185381a0909dbf7f1e62e5a8fb9704a105f2ffcdc8b35822ab28f758ce02R103
core/runtime/TRTEngine.h
pre_allocated states in struct TorchTRTRuntimeStates
https://github.com/pytorch/TensorRT/pull/3276/files#diff-3d6304b0f21f64bdf7867ceac346866d505619401e9c7b6a696cfe3bb567254eR37
docsrc/index.rst
https://github.com/pytorch/TensorRT/pull/3276/files#diff-8b1e4af58f12f81caed8b4870394f5f79f926a3a731eef74cedafda2f2ff2be3R89
) -> Tuple[bool, bool, bool]: | ||
# Evaluates whether certain conditions are met to enable CUDA Graph recording or to use pre-allocated outputs | ||
# based on the current and previous states, as well as input shape has changed | ||
input_shape_changed: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same clean up is required in TRTEngine.h
Description
Revisit NonZero. Support Data Dependent Shape (DDS) and NonZero op in this PR.
Torch-TRT needs static output shapes to create output tensors in the runtime. However, for some ops with DDS, like NonZero, its output shape depends on the input data content, not just the input shape, so we need to propagate the shapes using
ShapeProp
likeShapeProp(fx_module).propagate(*torch_inputs)
, and then pass in the output shape to the Torch-TRT runtime.Fixes #2516
Type of change
Checklist: