-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop Alignment support for Arm64 #60135
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsAdd support for loop alignment for Arm64. This is an extension of adaptive loop alignment work done for xarch in #44370 . The loop alignment is adaptive and the padding amount will be adjusted based on the number of blocks needed to fit the loop. Since the instruction encoding size for Arm64 is 4 bytes, 4
Benchmarks windows/arm64 diffs:
Detail diffs
Libraries windows/arm64 diffs:
Detail diffs
It is worth noting that because of fixed size encoding of arm64, very few methods benefit from loop alignment. For example, below is the diff for xarch. x64:
arm64:
Finally, as expected, the allocation size regression is same as code size regression we see above and there is no mismatch because there is we do not over-estimation the instruction sizes for arm64. I tried to measure performance for some of the benchmarks that asmdiff pointed out and I didn't notice any significant performance and stability difference (cc @TamarChristinaArm ), but the plan is to merge this in and monitor the perf lab to see if it impacts benchmarks over the time. If we see that it adversely affects the benchmarks, we will revert this feature for arm64.
|
025c2fe
to
1569b59
Compare
@dotnet/jit-contrib |
From the private perf run (thanks to @DrewScoggins), all the benchmarks flagged are noisy and there is not an obvious regression from this work. Once this PR is merged, we will continue monitoring the stability of benchmarks. |
INS_OPTS_D_TO_H // Double to Half | ||
|
||
#if FEATURE_LOOP_ALIGN | ||
, INS_OPTS_ALIGN // Align instruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you can have an INS_align with INS_OPTS_NONE that indicates the align instruction is ignored? In which case, shouldn't the comment say, "Align instruction that will be used (not ignored)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you just use INS_align without any special INS_OPTS to mean "align 4 bytes", and munge it to INS_nop if you decide not to use it for alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is there is a real INS_nop
in arm64 that we use it occasionally and it will be hard to distinguish at few places if the INS_nop
is from the alignment or something else. With INS_OPTS
it becomes easier because all the way through the instruction remain ins_align
.
Addressed all your feedback. Please take another look. |
Add support for loop alignment for Arm64. This is an extension of adaptive loop alignment work done for xarch in #44370 .
The loop alignment is adaptive and the padding amount will be adjusted based on the number of blocks needed to fit the loop. Since the instruction encoding size for Arm64 is 4 bytes, 4
NOP
will be added for a loop that can fit in single chunk of 32 bytes, and the padding amount will reduce as the loop size increases.Benchmarks windows/arm64 diffs:
Detail diffs
Libraries windows/arm64 diffs:
Detail diffs
It is worth noting that because of fixed size encoding of arm64, very few methods benefit from loop alignment. For example, below is the diff for xarch.
x64:
arm64:
Finally, as expected, the allocation size regression is same as code size regression we see above and there is no mismatch because there is we do not over-estimation the instruction sizes for arm64.
I tried to measure performance for some of the benchmarks that asmdiff pointed out and I didn't notice any significant performance and stability difference (cc @TamarChristinaArm ), but the plan is to merge this in and monitor the perf lab to see if it impacts benchmarks over the time. If we see that it adversely affects the benchmarks, we will revert this feature for arm64.