JIT: Limit 3-opt to 1000 swaps per run #112259

amanasifkhalid · 2025-02-07T04:44:05Z

Fixes #111988.

amanasifkhalid · 2025-02-07T04:50:17Z

/azp run runtime-coreclr jitstress-random

azure-pipelines · 2025-02-07T04:50:27Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2025-02-07T17:11:59Z

Seems like the jitstress-random failures are unrelated, but can you dig in?

I'm ok with having a limit but am still curious exactly what was happening in the example, it looked like it had found a cyclic sequence of very profitable moves, which should not be possible. Was the profile inconsistent? Were we saturating/overflowing computations?

amanasifkhalid · 2025-02-07T17:56:57Z

Seems like the jitstress-random failures are unrelated, but can you dig in?

Sure. arm64 legs are hitting #112278 in SVE tests; Cobalt machines were just turned on in CI, so I expect we'll see more bugs come up. The remaining failure is tracked by #112281, and might be bad codegen. Neither look related to 3-opt.

Was the profile inconsistent? Were we saturating/overflowing computations?

Both are true in this case, but the latter is problematic (and I suppose we cannot enforce profile consistency when the latter is true). Some loop bodies in the method have weights exceeding BB_MAX_WEIGHT, and summing weights this large seems to be breaking floating-point comparisons such that we get stuck moving the same three ranges over and over:

Swapping partitions [BB965, BB965] and [BB966, BB964] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB966, BB966] and [BB967, BB965] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB967, BB964] and [BB965, BB966] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB965, BB965] and [BB966, BB964] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB966, BB966] and [BB967, BB965] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB967, BB964] and [BB965, BB966] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
etc.

Those block ranges in particular have flow exceeding BB_MAX_WEIGHT. I suppose we could skip 3-opt altogether if we encounter weights over a certain size under the assumption that they would break the cost model, but I think an iteration limit is more robust.

it looked like it had found a cyclic sequence of very profitable moves, which should not be possible.

FWIW, in Debug builds, 3-opt recomputes the overall layout cost after each swap, and asserts that the move improved the cost. For excessively large costs, we skip the assertion to avoid false alarms from floating-point imprecision, and for profiles of "normal" size, I've yet to see this assert fail, so I haven't seen 3-opt's greediness invariant fail yet.

AndyAyersMS · 2025-02-07T18:27:06Z

You could imagine tempering profile synthesis so that in deep loop nests we start to curtail $C_p$ more dramatically. Right now we allow $C_p$ to be as large as $0.999$ per loop, so each loop can amplify counts $1000$ times:

$$ A = 1 / (1-C_p) = 1 / (1 - 0.999) = 1000 $$

And this compounds in a nest; if say we have $N$ nested loops the maximum overall amplification would be

$$ A_{N} = (1 / (1-C_p)) ^ N = 1000^N = 10^{3N}$$

But I'm not sure of the best way to do this, likely we want to keep the inner loop $C_p$ high and have the outer loop $C_p$ drop off.... maybe we can limit overall $A_N$ to $10^{12}$ or something.

(not sure if in the example above we are using PGO and/or synthesis, but the long term plan is that we'll always have these).

amanasifkhalid · 2025-02-07T20:45:47Z

/ba-g blocked by test timeout

amanasifkhalid · 2025-02-07T20:56:49Z

(not sure if in the example above we are using PGO and/or synthesis, but the long term plan is that we'll always have these).

In this case, STRESS_BB_PROFILE exercises synthesis with random profile data, but even without profile stress, I imagine synthesis would produce similarly extreme results with the sheer amount of nesting in this method:

ccp: BB570 :: 1.0 (header)
ccp: BB05 :: 0.5
ccp: BB568 :: 1 (nested header)
ccp: BB06 :: 0.5
ccp: BB566 :: 5 (nested header)
ccp: BB07 :: 4.5
ccp: BB564 :: 45 (nested header)
...
ccp: BB198 :: 4.110983e+70 (nested header)
ccp: BB191 :: 2.055492e+70
ccp: BB196 :: 2.055492e+71 (nested header)
ccp: BB192 :: 1.849942e+71
ccp: BB194 :: 3.699885e+71 (nested header)

Your proposal to limit amplification sounds reasonable, at least from a layout perspective. Since the initial block layout keeps loop bodies compact, I don't think inner loop weights have to be amplified all that much to dissuade 3-opt from breaking them up.

* main: (41 commits) Automated bump of chrome version (dotnet#112309) Add `GetDeclaringType` to `PropertyDefinition` and `EventDefinition`. (dotnet#111646) Update the System.ComponentModel.Annotations solution to build in VS (dotnet#112313) JIT: initial support for stack allocating arrays of GC type (dotnet#112250) [main] Update dependencies from dotnet/roslyn (dotnet#112260) Update Xcode casing (dotnet#112307) update the location of assert for REG_ZR check (dotnet#112294) Enable `SA1206`: Keyword ordering (dotnet#112303) Address feedback on dense FrozenDictionary optimization (dotnet#112298) Start regular pri-1 tests runs with native AOT (dotnet#111391) Observe exceptions from _connectionCloseTcs (dotnet#112190) Test failure - SendAsync_RequestVersion20_ResponseVersion20 (dotnet#112232) Fix init race in mono_class_try_get_[shortname]_class. (dotnet#112282) Remove repeated call to DllMain (dotnet#112285) Replace bitvector.h/cpp with ptrArgTP type in gc_unwind_x86.h/inl (dotnet#112268) JIT: Limit 3-opt to 1000 swaps per run (dotnet#112259) [main] Update dependencies from dotnet/icu, dotnet/runtime-assets (dotnet#112120) Update dependencies from https://github.com/dotnet/emsdk build 20250205.3 (dotnet#112223) Fix EventPipe on Android CoreClr. (dotnet#112270) Fix exception handling in the prestub worker (dotnet#111937) ...

Limit 3-opt to 1000 swaps

98075b0

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 7, 2025

dotnet-policy-service bot assigned amanasifkhalid Feb 7, 2025

This was referenced Feb 7, 2025

JIT: Assertion failed 'isGeneralRegisterOrSP(reg2)' in SVE tests #112278

Closed

JIT: Test failed global::JitTest_castclass_ldloc_cs.TestClass.TestEntryPoint #112281

Closed

AndyAyersMS approved these changes Feb 7, 2025

View reviewed changes

amanasifkhalid merged commit 5f652c2 into dotnet:main Feb 7, 2025
124 of 131 checks passed

amanasifkhalid deleted the 3-opt-iter-limit branch February 7, 2025 20:46

build-analysis bot mentioned this pull request Feb 7, 2025

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked failing with System.OverflowException #112286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Limit 3-opt to 1000 swaps per run #112259

JIT: Limit 3-opt to 1000 swaps per run #112259

amanasifkhalid commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

azure-pipelines bot commented Feb 7, 2025

AndyAyersMS commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

AndyAyersMS commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

JIT: Limit 3-opt to 1000 swaps per run #112259

JIT: Limit 3-opt to 1000 swaps per run #112259

Conversation

amanasifkhalid commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

azure-pipelines bot commented Feb 7, 2025

AndyAyersMS commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

AndyAyersMS commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025

amanasifkhalid commented Feb 7, 2025