Support Data Dependent Shape (DDS) and NonZero op #3364

zewenli98 · 2025-01-23T18:30:10Z

Description

Revisit NonZero. Support Data Dependent Shape (DDS) and NonZero op in this PR.

Torch-TRT needs static output shapes to create output tensors in the runtime. However, for some ops with DDS, like NonZero, its output shape depends on the input data content, not just the input shape, so we need to propagate the shapes using ShapeProp like ShapeProp(fx_module).propagate(*torch_inputs), and then pass in the output shape to the Torch-TRT runtime.

Fixes #2516

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

tests/py/dynamo/conversion/test_nonzero_aten.py

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

tests/py/dynamo/conversion/test_nonzero_aten.py

narendasan

LGTM

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

tests/py/dynamo/conversion/test_nonzero_aten.py

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

peri044 · 2025-02-01T01:23:59Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -524,12 +537,26 @@ def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, .

                self._caller_stream.wait_stream(self._engine_stream)

-            if self.use_pre_allocated_outputs:


You might want to check with @keehyuna on how this change would affect the use_pre_allocated_outputs optimization that he implemented ?

yeah I think we can directly remove pre_allocated_outputs related arguments because we don't have to manually allocate memory to outputs. Instead, reallocate_output_async fuunction will take care of everything and automatically reuse pre_allocated_outputs if possible. @keehyuna Can you take a look?

Yes, pre_allocated_outputs is not required if allocation is properly reused. I tried to check the behavior but there was "Segmentation fault" during build_serialized_network()

I added nvtx event in PythonTorchTensorRTModule:TensorRTRuntime and reallocate_output_async.
They are running concurrently when I test resnet model. It seems pre_allocated_outputs is not required and async output will have benefit in most cases. I will check perf in more cases. I think this pr can go ahead as use_pre_allocated_outputs is disabled by default. I will revert pre-allocated output feature after some more test.

BTW, will cpp runtime have async output?

yeah I'm working on cpp runtime but it may take a couple of days

Hi @keehyuna I have updated the cpp runtime. Can you confirm if pre_allocated_outputs is not needed anymore?

Hi @zewenli98 , I checked cpp runtime, reallocateOutputAsync and ExecutionContext are running concurrently as well. I will clean up remaining code of pre_allocated_outputs .

@keehyuna Thanks for confirming. I have cleared up some pre_allocated_outputs related codes by the way. Can you take a look if that's complete? If anything else should be removed please let me know!

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2025-02-04 08:16:20.457716+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py	2025-02-04 08:16:43.664304+00:00
@@ -546,11 +546,15 @@
                for o, output_name in enumerate(self.output_names):
                    assert self.output_allocator is not None
                    shape = self.output_allocator.shapes.get(output_name, None)
                    self.output_shapes[o] = shape
                    dtype = self.output_dtypes[o]
-                    output = self.output_allocator.buffers.get(output_name, None).clone().detach()
+                    output = (
+                        self.output_allocator.buffers.get(output_name, None)
+                        .clone()
+                        .detach()
+                    )
                    prod = int(torch.prod(torch.tensor(shape)))
                    output = output.reshape(-1).view(dtype)[:prod].reshape(shape)
                    outputs.append(output)

            if len(outputs) == 1:

narendasan · 2025-02-05T22:48:42Z