Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Windows/x64: Decimal Regressions on 2/3/2025 6:32:46 PM +00:00 #112432

Open
performanceautofiler bot opened this issue Feb 11, 2025 · 5 comments
Open
Labels
arch-x64 area-System.Numerics os-windows runtime-coreclr specific to the CoreCLR runtime tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark untriaged New issue has not been triaged by the area owner

Comments

@performanceautofiler
Copy link

Run Information

Name Value
Architecture x64
OS Windows 10.0.22631
Queue ViperWindows
Baseline ad382caba7c0279af5f2b233847371eadfb42f5c
Compare 05d94d94028b5622b19734e0e1b60d7aca4667b3
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in LinqBenchmarks

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
81.55 ms 87.49 ms 1.07 0.03 False
40.51 ms 43.03 ms 1.06 0.01 False
40.51 ms 43.10 ms 1.06 0.00 False

graph
graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'LinqBenchmarks*'

LinqBenchmarks.Order00ManualX

ETL Files

Histogram

JIT Disasms

LinqBenchmarks.Order00LinqQueryX

ETL Files

Histogram

JIT Disasms

LinqBenchmarks.Order00LinqMethodX

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@LoopedBard3
Copy link
Member

LoopedBard3 commented Feb 11, 2025

@LoopedBard3 LoopedBard3 removed the PGO label Feb 11, 2025
@LoopedBard3 LoopedBard3 transferred this issue from dotnet/perf-autofiling-issues Feb 11, 2025
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 11, 2025
@LoopedBard3 LoopedBard3 changed the title [Perf] Windows/x64: 3 Regressions on 2/3/2025 6:32:46 PM +00:00 [Perf] Windows/x64: Decimal Regressions on 2/3/2025 6:32:46 PM +00:00 Feb 11, 2025
@LoopedBard3 LoopedBard3 added tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark labels Feb 11, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

tannergooding commented Feb 12, 2025

CC. @Daniel-Svensson given the official benchmarks are showing regressions we may need to look at reverting.

There are also some improvements (linked to from the bottom of #99212), so some further analysis might be desirable.

Is that something you have the capacity to look at?

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 18, 2025
@Daniel-Svensson
Copy link
Contributor

I can try to have a quick look, but it will take a week or two before I have the time and energy.

I had a quick look at the diff and it nothing stood out for the compare (add/sub) case.
It might be related to the BigMul change which now uses MulX that don't generate as good code as the normal multiply, depending on distribution of number scaling.
If that is the case only the change in Add/Sub needs to be reverted.

There are also some improvements (linked to from the bottom of #99212), so some further analysis might be desirable.

With the improvements, do you refer to the switch statement or the list of performance improvements from the PR Summary under "Remarks for better performance" (add with carry, MultiplyNoFlags and ShiftLeft128) ?

I do not think I have the capacity to fix any of the instrincts at the moment, but if someone does then I will gladly help to apply them to the decimal code. (I think add/sub with carry made the c++ compare/add/sub code 4-10% faster with a lot less branches if I remember correctly)

@tannergooding
Copy link
Member

With the improvements, do you refer to the switch statement or the list of performance improvements from the PR Summary under "Remarks for better performance" (add with carry, MultiplyNoFlags and ShiftLeft128) ?

This list at the very bottom of the PR:

Image

with a lot less branches if I remember correctly

Notably, less branches does not mean faster code. This is dependent on the work being done and the predictability of the branches. The cost of a predicted branch is effectively 0-1 cycles, while the cost of a mispredicted branch is effectively 20 cycles.

Micro benchmarks tend to perform better with branches due to typically testing the same input over and over. What is best for real world code tends to be workload specific.


The actual perf regression here look to be related to CompareTo which ends up calling VarDecCmpSub, so this is most likely related to this switch, potentially due to the out parameter or similar:

-                        ulong tmpLow = Math.BigMul((uint)low64, power);
-                        ulong tmp = Math.BigMul((uint)(low64 >> 32), power) + (tmpLow >> 32);
-                        low64 = (uint)tmpLow + (tmp << 32);
-                        tmp >>= 32;
+                        ulong tmp = Math.BigMul(low64, power, out low64);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-x64 area-System.Numerics os-windows runtime-coreclr specific to the CoreCLR runtime tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

5 participants