-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the default loop unroll limit to 4 #80353
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsToday, the JIT defaults to 1 unless the upper limit is a The AMD and Intel software optimization manuals all recommend unrolling small loops. However, the limits and rules for when to unroll differs between them:
This PR updates the default for the JIT to 4. The reason
|
For reference, here are some DEFAULT_UNROLL_LOOP_MAX_ITERATION_COUNT: 2Frameworks
Benchmarks
DEFAULT_UNROLL_LOOP_MAX_ITERATION_COUNT: 4Frameworks
Benchmarks
DEFAULT_UNROLL_LOOP_MAX_ITERATION_COUNT: 6Frameworks
Benchmarks
DEFAULT_UNROLL_LOOP_MAX_ITERATION_COUNT: 8Frameworks
Benchmarks
|
/azp list |
The SuperPMI diffs in CI are a lot larger than what I got locally, particularly in tests, there is a small TP impact as well. CC. @BruceForstall. Would appreciate some input on if you think this is something worth moving forward on given where all the loop code is today. It is worth noting most of the tests are HWIntrinsic tests where the code is explicitly doing constant count loops that are emulating the actual instruction functionality for minimal correctness verification. |
My summary: it only modestly kicks in on most collections, except for coreclr_tests where it is mostly in the templatized HW Intrinsics tests. E.g., it hits in I'd like to see this change checked in, especially early in the cycle where we can get perf numbers early. Choosing cc @dotnet/jit-contrib |
Merging in latest main and rerunning CI since general direction seems favorable/positive Definitely needs to run stress tests before PR gets merged |
Are there examples where we know this extra unrolling will make a big impact on perf? |
@AndyAyersMS not in our own code. Most of our own perf critical code is explicitly using vectors over unknown length inputs. |
… unrolling and large amounts of dead code elimination
Fixed the one failure which was a change around the generated code which impacted the new "codegen verification" test feature. The particular tests were impacted since they are getting unrolled now and that also leads to some large dead code elimination which meant that verifying the codegen would've been very difficult. That also calls out that there is an existing missing loop optimization around converting something like this to not loop and just assign for (var i = 0; i < 4; i++)
{
result = AdvSimd.CompareEqual(left, asVar);
} That is, recognizing when the loop is doing throwaway or non-side effecting work if executed more than once @BruceForstall, do you know if we have an existing issue tracking that? |
CC. @dotnet/jit-contrib, this should be ready for review. If we can get it reviewed/merged before EOD tomorrow it will make the first preview and give us the greatest amount of time for feedback/etc. |
Not that I'm aware of. Looks like we should be able to hoist this loop-invariant code leading to an empty loop. |
// 'asVar' should be propagated. | ||
// 'AdvSimd.CompareEqual' should be hoisted out of the loops. | ||
// ARM64-FULL-LINE: fcmeq {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, #0.0 | ||
// ARM64: blt | ||
// ARM64-NOT: fcmeq | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a test added quite a bit before DisasmCheck
existed and is covering specific containment functionality: #33972
The disasm checks were added later since it represents a good candidate for validating the containment functionality is actually working as expected.
In this particular case, we could probably cover it as something like:
// ARM64-FULL-LINE: fcmeq {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, #0.0
// ARM64-NOT: blt
// ARM64-NOT: fcmeq
But the combination of loop unrolling + dead code optimization makes it a case that's a bit difficult to test/understand the reasoning around.
I think we're fine not having the coverage for these two subcases in particular given all the other disasm coverage of the scenario in the same file, but I'd be fine with adding the above lines if people would prefer they remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HasDisasmCheck
adds disasm checking in addition to execution. (In fact, as it is, it requires execution because we get the disasm via normal execution plus DOTNET_JitDisasm.)
Loop logic is difficult to test in a line-based system. The existence of fcmeq
is probably covered reasonably by the other tests, so this one is a question of whether the trouble/brittleness is worth it to verify that intended loop transformations have occurred. (and it sounds like the answer is 'no')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HasDisasmCheck
adds disasm checking in addition to execution
That's good to hear. However, there's an unfortunate wrinkle: it appears that all HasDisasmCheck tests also must set:
DOTNET_TieredCompilation=0
DOTNET_JITMinOpts=0
so those tests don't get alternative stress mode configurations run.
(this is subject for another issue, not this PR, or course)
@@ -9424,7 +9424,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | |||
#define DEFAULT_MAX_LOOPSIZE_FOR_ALIGN DEFAULT_ALIGN_LOOP_BOUNDARY * 3 | |||
|
|||
// By default only single iteration loops will be unrolled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment need updating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, oops. Totally missed the comment
Will get up a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, the JIT defaults to 1 unless the upper limit is a
Vector###<T>.Count
property in which case it will try to unroll provided the method body isn't "too large". This means that devs can write a slightly more obscure pattern to get loop unrolling if truly desired, but the more natural pattern will not light up.The AMD and Intel software optimization manuals all recommend unrolling small loops. However, the limits and rules for when to unroll differs between them:
This PR updates the default for the JIT to 4. The reason
4
is chosen is because it gives some improvements to various functions and code paths without significantly increasing code size. Likewise, it has some level of significance to both the JIT and common code users will write in perf sensitive scenarios:4x T
) or an optimized Matrix4x4 (4x Vector4<T>
) which are heavily used in graphics and UI stacksVector128<T>
(this is the common SIMD type and the size natural size .NET uses for indexing and most other "sizes" in managed)I ran perf numbers for the
benchmarks
that had PMI diffs and we have a few that are unchanged and a few that are faster. I expect this may require some tweaking based on what we see in the actual perf runs, but we are early in the cycle and have plenty of time to react or back out the change if necessary.