Fixed batchnorm bug #3170

cehongwang · 2024-09-20T20:44:13Z

Description

Batch norm value trace fails when the intial weights are identical. This pull request fixed batch norm fast refit case using name trace.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

narendasan

@zewenli98 what are your thoughts here, seems like a simple enough change. @cehongwang after this is in would there be any other failure modes for fast refit?

zewenli98

Is this PR still a draft or ready for review (rebase to main is needed)?

zewenli98 · 2024-10-01T22:08:53Z

py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py

+        else get_trt_tensor(ctx, weight, f"{name}_weight")
+    )
+    bias = (
+        get_trt_tensor(ctx, 1.0, f"{name}_bias")


Should this be 0?

zewenli98 · 2024-10-01T22:09:37Z

py/torch_tensorrt/dynamo/conversion/impl/normalization/ops.py

+        else get_trt_tensor(ctx, bias, f"{name}_bias")
+    )
+    running_mean = (
+        get_trt_tensor(ctx, 1.0, f"{name}_running_mean")


Should this be 0?

zewenli98 · 2024-10-01T22:15:03Z

@zewenli98 what are your thoughts here, seems like a simple enough change. @cehongwang after this is in would there be any other failure modes for fast refit?

Since batchnorm was refactored previously, renaming maybe work for fast refit.

cehongwang · 2024-10-01T22:18:09Z

I am setting up the naming trace for fast refit by naming the trt weight name same as what it is in the state dict. This is still a draft I am having trouble with installing the code lint on my current computer. The code is running and passed the tests though. Will convert to a PR after fixing the lint.

…

On Tue, Oct 1, 2024 at 3:15 PM Zewen (Evan) Li ***@***.***> wrote: @zewenli98 <https://github.com/zewenli98> what are your thoughts here, seems like a simple enough change. @cehongwang <https://github.com/cehongwang> after this is in would there be any other failure modes for fast refit? Since batchnorm was refactored previously, renaming maybe work for fast refit. — Reply to this email directly, view it on GitHub <#3170 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5PD2UEYCT52QRQA3C43V4LZZMNH3AVCNFSM6AAAAABOS2MK56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBXGE4DAMBZGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

cehongwang · 2024-10-01T22:21:04Z

@zewenli98 what are your thoughts here, seems like a simple enough change. @cehongwang after this is in would there be any other failure modes for fast refit?

I will do similar changes to other applicable converters to make the naming trace more robust. There should not be other failures after that because we have the value trace to guarantee the correctness.

zewenli98 · 2024-10-02T04:09:51Z

@cehongwang @narendasan Can you guys take a look at the issue #3200 which may be related to this PR

zewenli98

LGTM

narendasan

Rebase and lgtm

narendasan · 2024-10-11T22:42:16Z

examples/dynamo/refit_engine_example.py

@@ -69,7 +69,8 @@
    debug=debug,
    min_block_size=min_block_size,
    torch_executed_ops=torch_executed_ops,
-    make_refittable=True,
+    make_refitable=True,


Should be 2 ts now, might want to rebase this branch

narendasan · 2024-10-11T22:43:24Z

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

@@ -477,12 +477,18 @@ def _save_weight_mapping(self) -> None:
                    # Retrieve each weight name(s) in state_dict
                    if layer_type == "CONSTANT":
                        if "embedding" in suffix:
-                            sd_weight_name = f"{sd_weight_name}.{torch_attr[0]}"


@zewenli98 keep track of this, seems like there could be a lot of possible names we need to have might want to look at a generic solution later

stale

narendasan

LGTM!

Fixed batchnorm bug

414d972

facebook-github-bot added the cla signed label Sep 20, 2024

github-actions bot added component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Sep 20, 2024

github-actions bot requested a review from apbose September 20, 2024 20:44

cehongwang marked this pull request as draft September 20, 2024 20:46

github-actions bot added the component: tests Issues re: Tests label Sep 20, 2024

cehongwang requested review from narendasan and zewenli98 September 20, 2024 21:01

cehongwang force-pushed the refit-name-trace branch from be7eeb6 to d3b2c04 Compare September 20, 2024 21:24

narendasan reviewed Oct 1, 2024

View reviewed changes

zewenli98 reviewed Oct 1, 2024

View reviewed changes

Changed test cases

2734ebc

cehongwang force-pushed the refit-name-trace branch from d3b2c04 to 2734ebc Compare October 11, 2024 21:39

Merge branch 'main' into refit-name-trace

d219072

This comment was marked as resolved.

Sign in to view

lint

69c0fc1

zewenli98 marked this pull request as ready for review October 11, 2024 22:13

zewenli98 requested review from narendasan and zewenli98 October 11, 2024 22:26

zewenli98 approved these changes Oct 11, 2024

View reviewed changes

narendasan reviewed Oct 11, 2024

View reviewed changes

fix typos

c077dde

zewenli98 requested a review from narendasan October 11, 2024 23:18

narendasan approved these changes Oct 14, 2024

View reviewed changes

zewenli98 merged commit 6303068 into main Oct 14, 2024
69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed batchnorm bug #3170

Fixed batchnorm bug #3170

cehongwang commented Sep 20, 2024

narendasan left a comment

zewenli98 left a comment

zewenli98 Oct 1, 2024

zewenli98 Oct 1, 2024

zewenli98 commented Oct 1, 2024

cehongwang commented Oct 1, 2024 via email •

edited

Loading

cehongwang commented Oct 1, 2024

zewenli98 commented Oct 2, 2024

This comment was marked as resolved.

zewenli98 left a comment

narendasan left a comment

narendasan Oct 11, 2024

zewenli98 Oct 11, 2024

narendasan Oct 11, 2024

narendasan left a comment

Fixed batchnorm bug #3170

Fixed batchnorm bug #3170

Conversation

cehongwang commented Sep 20, 2024

Description

Type of change

Checklist:

narendasan left a comment

Choose a reason for hiding this comment

zewenli98 left a comment

Choose a reason for hiding this comment

zewenli98 Oct 1, 2024

Choose a reason for hiding this comment

zewenli98 Oct 1, 2024

Choose a reason for hiding this comment

zewenli98 commented Oct 1, 2024

cehongwang commented Oct 1, 2024 via email • edited Loading

cehongwang commented Oct 1, 2024

zewenli98 commented Oct 2, 2024

This comment was marked as resolved.

zewenli98 left a comment

Choose a reason for hiding this comment

narendasan left a comment

Choose a reason for hiding this comment

narendasan Oct 11, 2024

Choose a reason for hiding this comment

zewenli98 Oct 11, 2024

Choose a reason for hiding this comment

narendasan Oct 11, 2024

Choose a reason for hiding this comment

narendasan left a comment

Choose a reason for hiding this comment

cehongwang commented Oct 1, 2024 via email •

edited

Loading