Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Data Dependent Shape (DDS) and NonZero op #3364

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zewenli98
Copy link
Collaborator

Description

Revisit NonZero. Support Data Dependent Shape (DDS) and NonZero op in this PR.

Torch-TRT needs static output shapes to create output tensors in the runtime. However, for some ops with DDS, like NonZero, its output shape depends on the input data content, not just the input shape, so we need to propagate the shapes using ShapeProp like ShapeProp(fx_module).propagate(*torch_inputs), and then pass in the output shape to the Torch-TRT runtime.

Fixes #2516

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@zewenli98 zewenli98 self-assigned this Jan 23, 2025
@github-actions github-actions bot added component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jan 23, 2025
@github-actions github-actions bot requested a review from apbose January 23, 2025 18:30
@zewenli98 zewenli98 force-pushed the support_dds_and_nonzero branch from 2ee9299 to 3b60296 Compare January 24, 2025 06:56
Copy link
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -524,12 +537,26 @@ def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, .

self._caller_stream.wait_stream(self._engine_stream)

if self.use_pre_allocated_outputs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to check with @keehyuna on how this change would affect the use_pre_allocated_outputs optimization that he implemented ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we can directly remove pre_allocated_outputs related arguments because we don't have to manually allocate memory to outputs. Instead, reallocate_output_async fuunction will take care of everything and automatically reuse pre_allocated_outputs if possible. @keehyuna Can you take a look?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pre_allocated_outputs is not required if allocation is properly reused. I tried to check the behavior but there was "Segmentation fault" during build_serialized_network()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added nvtx event in PythonTorchTensorRTModule:TensorRTRuntime and reallocate_output_async.
They are running concurrently when I test resnet model. It seems pre_allocated_outputs is not required and async output will have benefit in most cases. I will check perf in more cases. I think this pr can go ahead as use_pre_allocated_outputs is disabled by default. I will revert pre-allocated output feature after some more test.

BTW, will cpp runtime have async output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm working on cpp runtime but it may take a couple of days

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @keehyuna I have updated the cpp runtime. Can you confirm if pre_allocated_outputs is not needed anymore?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zewenli98 , I checked cpp runtime, reallocateOutputAsync and ExecutionContext are running concurrently as well. I will clean up remaining code of pre_allocated_outputs .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keehyuna Thanks for confirming. I have cleared up some pre_allocated_outputs related codes by the way. Can you take a look if that's complete? If anything else should be removed please let me know!

@zewenli98 zewenli98 force-pushed the support_dds_and_nonzero branch from 2846ef7 to 605bba0 Compare February 4, 2025 08:16
@github-actions github-actions bot added the component: core Issues re: The core compiler label Feb 4, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2025-02-04 08:16:20.457716+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2025-02-04 08:16:43.664304+00:00
@@ -546,11 +546,15 @@
                for o, output_name in enumerate(self.output_names):
                    assert self.output_allocator is not None
                    shape = self.output_allocator.shapes.get(output_name, None)
                    self.output_shapes[o] = shape
                    dtype = self.output_dtypes[o]
-                    output = self.output_allocator.buffers.get(output_name, None).clone().detach()
+                    output = (
+                        self.output_allocator.buffers.get(output_name, None)
+                        .clone()
+                        .detach()
+                    )
                    prod = int(torch.prod(torch.tensor(shape)))
                    output = output.reshape(-1).view(dtype)[:prod].reshape(shape)
                    outputs.append(output)

            if len(outputs) == 1:

@zewenli98 zewenli98 requested a review from keehyuna February 4, 2025 21:04
return outputs;
for (const auto& output_name : compiled_engine->out_binding_names) {
if (!compiled_engine->exec_ctx->setOutputAllocator(output_name.c_str(), compiled_engine->output_allocator.get())) {
throw std::runtime_error("Failed to set output allocator for " + output_name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the error utilities we already have like TORCHTRT_THROW_ERROR

"Error while setting the output tensor address");
}
}
setup_output_allocator(compiled_engine);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do this everytime the engine is going to run?

} else if (outputs.size() > 0) {
current_device_id = outputs[0].device().index(); // Done this way to avoid a call to cudart
} else {
current_device_id = c10::cuda::current_device();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to call cudart right? That will incur a significant cost

auto name = compiled_engine->out_binding_names[i];
auto dims = compiled_engine->output_allocator->getShapes().at(name);
auto dtype = util::TRTDataTypeToScalarType(compiled_engine->exec_ctx->getEngine().getTensorDataType(name.c_str()));
at::Tensor output = compiled_engine->output_allocator->getBuffers().at(name).clone().detach();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not directly copy to the result buffers like before instead of allocating new memory here? or does it not make a difference.

}
output = output.reshape(-1).view(dtype).slice(0, 0, prod).reshape(dims_vec);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that the Output allocator needs to trigger rerecording of the cuda graph when the buffers change for each runtime

if (it == buffers.end() || it->second.sizes() != shape) {
buffers[tensorName] = at::empty(shape, at::TensorOptions().dtype(dtypes.at(tensorName)).device(c10::kCUDA));
}
return buffers[tensorName].data_ptr();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are paying the index cost twice here? just return it.second in the else and return a handle to the new empty tensor after you insert it in the first

std::vector<int64_t> shape = {static_cast<int64_t>(size)};
auto it = buffers.find(tensorName);
if (it == buffers.end() || it->second.sizes() != shape) {
buffers[tensorName] = at::empty(shape, at::TensorOptions().dtype(dtypes.at(tensorName)).device(c10::kCUDA));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know what device to use?

std::vector<int64_t> shape = {static_cast<int64_t>(size)};
auto it = buffers.find(tensorName);
if (it == buffers.end() || it->second.sizes() != shape) {
buffers[tensorName] = at::empty(shape, at::TensorOptions().dtype(dtypes.at(tensorName)).device(c10::kCUDA));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as this happens the recapture flag needs to be set

Copy link
Collaborator

@keehyuna keehyuna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) -> Tuple[bool, bool, bool]:
# Evaluates whether certain conditions are met to enable CUDA Graph recording or to use pre-allocated outputs
# based on the current and previous states, as well as input shape has changed
input_shape_changed: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same clean up is required in TRTEngine.h

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime component: tests Issues re: Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aten.nonzero
5 participants