Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Limit 3-opt to 1000 swaps per run #112259

Merged
merged 1 commit into from
Feb 7, 2025

Conversation

amanasifkhalid
Copy link
Member

Fixes #111988.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 7, 2025
@amanasifkhalid
Copy link
Member Author

/azp run runtime-coreclr jitstress-random

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member

Seems like the jitstress-random failures are unrelated, but can you dig in?

I'm ok with having a limit but am still curious exactly what was happening in the example, it looked like it had found a cyclic sequence of very profitable moves, which should not be possible. Was the profile inconsistent? Were we saturating/overflowing computations?

@amanasifkhalid
Copy link
Member Author

Seems like the jitstress-random failures are unrelated, but can you dig in?

Sure. arm64 legs are hitting #112278 in SVE tests; Cobalt machines were just turned on in CI, so I expect we'll see more bugs come up. The remaining failure is tracked by #112281, and might be bad codegen. Neither look related to 3-opt.

Was the profile inconsistent? Were we saturating/overflowing computations?

Both are true in this case, but the latter is problematic (and I suppose we cannot enforce profile consistency when the latter is true). Some loop bodies in the method have weights exceeding BB_MAX_WEIGHT, and summing weights this large seems to be breaking floating-point comparisons such that we get stuck moving the same three ranges over and over:

Swapping partitions [BB965, BB965] and [BB966, BB964] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB966, BB966] and [BB967, BB965] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB967, BB964] and [BB965, BB966] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB965, BB965] and [BB966, BB964] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB966, BB966] and [BB967, BB965] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
Swapping partitions [BB967, BB964] and [BB965, BB966] (cost change = -12259964326927110866866776217202473468949912977468817408.000000)
etc.

Those block ranges in particular have flow exceeding BB_MAX_WEIGHT. I suppose we could skip 3-opt altogether if we encounter weights over a certain size under the assumption that they would break the cost model, but I think an iteration limit is more robust.

it looked like it had found a cyclic sequence of very profitable moves, which should not be possible.

FWIW, in Debug builds, 3-opt recomputes the overall layout cost after each swap, and asserts that the move improved the cost. For excessively large costs, we skip the assertion to avoid false alarms from floating-point imprecision, and for profiles of "normal" size, I've yet to see this assert fail, so I haven't seen 3-opt's greediness invariant fail yet.

@AndyAyersMS
Copy link
Member

You could imagine tempering profile synthesis so that in deep loop nests we start to curtail $C_p$ more dramatically. Right now we allow $C_p$ to be as large as $0.999$ per loop, so each loop can amplify counts $1000$ times:

$$ A = 1 / (1-C_p) = 1 / (1 - 0.999) = 1000 $$

And this compounds in a nest; if say we have $N$ nested loops the maximum overall amplification would be

$$ A_{N} = (1 / (1-C_p)) ^ N = 1000^N = 10^{3N}$$

But I'm not sure of the best way to do this, likely we want to keep the inner loop $C_p$ high and have the outer loop $C_p$ drop off.... maybe we can limit overall $A_N$ to $10^{12}$ or something.

(not sure if in the example above we are using PGO and/or synthesis, but the long term plan is that we'll always have these).

@amanasifkhalid
Copy link
Member Author

/ba-g blocked by test timeout

@amanasifkhalid amanasifkhalid merged commit 5f652c2 into dotnet:main Feb 7, 2025
124 of 131 checks passed
@amanasifkhalid amanasifkhalid deleted the 3-opt-iter-limit branch February 7, 2025 20:46
@amanasifkhalid
Copy link
Member Author

(not sure if in the example above we are using PGO and/or synthesis, but the long term plan is that we'll always have these).

In this case, STRESS_BB_PROFILE exercises synthesis with random profile data, but even without profile stress, I imagine synthesis would produce similarly extreme results with the sheer amount of nesting in this method:

ccp: BB570 :: 1.0 (header)
ccp: BB05 :: 0.5
ccp: BB568 :: 1 (nested header)
ccp: BB06 :: 0.5
ccp: BB566 :: 5 (nested header)
ccp: BB07 :: 4.5
ccp: BB564 :: 45 (nested header)
...
ccp: BB198 :: 4.110983e+70 (nested header)
ccp: BB191 :: 2.055492e+70
ccp: BB196 :: 2.055492e+71 (nested header)
ccp: BB192 :: 1.849942e+71
ccp: BB194 :: 3.699885e+71 (nested header)

Your proposal to limit amplification sounds reasonable, at least from a layout perspective. Since the initial block layout keeps loop bodies compact, I don't think inner loop weights have to be amplified all that much to dissuade 3-opt from breaking them up.

grendello added a commit to grendello/runtime that referenced this pull request Feb 10, 2025
* main: (41 commits)
  Automated bump of chrome version (dotnet#112309)
  Add `GetDeclaringType` to `PropertyDefinition` and `EventDefinition`. (dotnet#111646)
  Update the System.ComponentModel.Annotations solution to build in VS (dotnet#112313)
  JIT: initial support for stack allocating arrays of GC type (dotnet#112250)
  [main] Update dependencies from dotnet/roslyn (dotnet#112260)
  Update Xcode casing (dotnet#112307)
  update the location of assert for REG_ZR check (dotnet#112294)
  Enable `SA1206`: Keyword ordering (dotnet#112303)
  Address feedback on dense FrozenDictionary optimization (dotnet#112298)
  Start regular pri-1 tests runs with native AOT (dotnet#111391)
  Observe exceptions from _connectionCloseTcs (dotnet#112190)
  Test failure - SendAsync_RequestVersion20_ResponseVersion20 (dotnet#112232)
  Fix init race in mono_class_try_get_[shortname]_class. (dotnet#112282)
  Remove repeated call to DllMain (dotnet#112285)
  Replace bitvector.h/cpp with ptrArgTP type in gc_unwind_x86.h/inl (dotnet#112268)
  JIT: Limit 3-opt to 1000 swaps per run (dotnet#112259)
  [main] Update dependencies from dotnet/icu, dotnet/runtime-assets (dotnet#112120)
  Update dependencies from https://github.com/dotnet/emsdk build 20250205.3 (dotnet#112223)
  Fix EventPipe on Android CoreClr. (dotnet#112270)
  Fix exception handling in the prestub worker (dotnet#111937)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT/jit64/opt/rngchk/RngchkStress2_o/RngchkStress2_o.dll is timing out
2 participants