Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in readonly struct passed by ref #106679

Closed
idg10 opened this issue Aug 20, 2024 · 73 comments · Fixed by #106908
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@idg10
Copy link
Contributor

idg10 commented Aug 20, 2024

Description

One of the benchmarks in our Corvus.JsonSchema library has wildly different performance characteristics on .NET 9.0 on certain hardware:

CPU .NET 8.0 .NET 9.0 (9.0.24.40507)
Intel Core i9-9900K CPU 3.60GHz (Coffee Lake) 22ms 19ms
Intel Core i7-13800H 15ms 42ms

On the (rather old) Coffee Lake CPU, we see what you'd hope: .NET 9.0 is significantly faster than .NET 8.0.

On the much newer CPU (in a Surface Laptop Studio 2), in .NET 8.0 the benchmark runs a lot faster than on the old CPU, as is typical with a CPU that much newer. But running the same code on .NET 9.0 on that newer CPU is almost 3 times slower than with .NET 8.0 on the same CPU. (It's significantly slower even than .NET 8.0 on the much older CPU.)

To reproduce this, clone the repo from commit 2621745, run the Corvus.Json.Benchmark project, and select the ValidateLargeDocument benchmark.

The figures in the table above are for the ValidateLargeArrayCorvusV3 benchmark, but we see similar regressions (again, only on the newer CPU) for the ValidateLargeArrayCorvusV4 and ValidateLargeArrayCorvusValidator benchmarks.

We haven't yet succeeded in isolating whatever it is about this benchmark that produces these effects. So far our attempts to profile the code outside of BenchmarkDotNet haven't reproduced the issue. The code in question makes heavy use of System.Text.Json, so that's where we suspect the issue lies, but we can't prove that.

Configuration

Which version of .NET is the code running on?

NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 and .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

What OS version, and what distro if applicable?

On the Coffee Lake machine (which doesn't have this problem) we're running Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3).

The newer machine (which exhibits the problem) is running Windows 11 (10.0.22621.4037/22H2/2022Update/SunValley2).

What is the architecture (x64, x86, ARM, ARM64)?

x64, Intel Core.

If relevant, what are the specs of the machine?

On the machine that does not reproduce the problem, the CPU is as already described. The machine has 64GB of memory. (I can provide more details if required.)

The machine on which we see the problem is a Surface Laptop Studio 2 with 64GB of RAM.

Regression

We're seeing the regression from .NET 8.0.7 to .NET 9.0.0-preview.7.24405.7 (BenchmarkDotNet reports this as 9.0.24.40506.)

We first observed this with .NET 9.0.0-preview.6.

Data

The Throughput benchmark results on the older (Coffee Lake) CPU (on which we don't see the regression) are:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error     | StdDev    | Median    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio  |
|---------------------------------- |--------- |------------ |------------- |----------:|----------:|----------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|-------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  21.94 ms |  0.324 ms |  0.303 ms |  21.97 ms |  1.00 |    0.00 |          - |          - |         - |        23 B |         1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  21.39 ms |  0.246 ms |  0.218 ms |  21.36 ms |  0.97 |    0.01 |          - |          - |         - |        23 B |         1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  37.38 ms |  0.710 ms |  0.697 ms |  37.20 ms |  1.70 |    0.03 |          - |          - |         - |       961 B |        41.78 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 674.64 ms | 12.023 ms | 11.246 ms | 677.29 ms | 30.76 |    0.70 | 26000.0000 | 21000.0000 | 3000.0000 | 193961040 B | 8,433,088.70 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  18.58 ms |  0.271 ms |  0.241 ms |  18.50 ms |  0.85 |    0.01 |          - |          - |         - |        34 B |         1.48 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  17.97 ms |  0.232 ms |  0.193 ms |  17.96 ms |  0.82 |    0.01 |          - |          - |         - |        34 B |         1.48 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  38.59 ms |  2.037 ms |  5.975 ms |  37.90 ms |  1.63 |    0.24 |          - |          - |         - |      4936 B |       214.61 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 677.66 ms | 10.140 ms |  8.468 ms | 677.89 ms | 30.89 |    0.56 | 25000.0000 | 20000.0000 | 2000.0000 | 192861320 B | 8,385,274.78 |

As you can see, on that CPU .NET 9.0 does better than .NET 8.0.

Here are the same results for the 13th gen CPU in the Surface Laptop Studio 2:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  14.63 ms | 0.263 ms |  0.205 ms |  1.00 |    0.00 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  13.39 ms | 0.075 ms |  0.070 ms |  0.92 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  22.67 ms | 0.060 ms |  0.054 ms |  1.55 |    0.02 |          - |          - |         - |       902 B |         75.17 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 390.16 ms | 7.662 ms | 15.651 ms | 27.00 |    1.44 | 17000.0000 | 14000.0000 | 2000.0000 | 193948488 B | 16,162,374.00 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.31 ms | 0.151 ms |  0.141 ms |  2.90 |    0.04 |          - |          - |         - |        61 B |          5.08 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.11 ms | 0.158 ms |  0.148 ms |  2.61 |    0.04 |          - |          - |         - |        56 B |          4.67 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  59.78 ms | 0.105 ms |  0.098 ms |  4.09 |    0.06 |          - |          - |         - |      1306 B |        108.83 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 397.82 ms | 7.837 ms | 13.931 ms | 27.55 |    0.93 | 17000.0000 | 14000.0000 | 2000.0000 | 192858296 B | 16,071,524.67 |

As you can see, the .NET 9.0 numbers here (same preview build of .NET 9.0 - preview.7) for the first 3 benchmarks are significantly slower than for .NET 8.0. (And they are significantly slower than the same benchmarks on the much older CPU for either .NET 8.0 or 9.0 preview 7.)

@idg10 idg10 added the tenet-performance Performance related issue label Aug 20, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 20, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 20, 2024
@idg10 idg10 changed the title 3x perf regression on 13th Gen Intel Core (i7-13800H) 3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in System.Text.Json Aug 20, 2024
@huoyaoyuan
Copy link
Member

How about installing 9.0 package of System.Text.Json on 8.0 runtime? This helps identifying the problem in JIT or new code pattern used in System.Text.Json.

@idg10
Copy link
Contributor Author

idg10 commented Aug 20, 2024

Using the 9.0 System.Text.Json package on 8.0 doesn't seem to change anything:

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  12.66 ms | 0.085 ms |  0.080 ms |  1.00 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  12.93 ms | 0.140 ms |  0.131 ms |  1.02 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  21.74 ms | 0.049 ms |  0.046 ms |  1.72 |    0.01 |          - |          - |         - |       920 B |         76.67 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 408.42 ms | 8.010 ms |  8.226 ms | 32.27 |    0.66 | 17000.0000 | 14000.0000 | 2000.0000 | 195061544 B | 16,255,128.67 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.09 ms | 0.093 ms |  0.073 ms |  3.33 |    0.02 |          - |          - |         - |        65 B |          5.42 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.18 ms | 0.144 ms |  0.135 ms |  3.02 |    0.02 |          - |          - |         - |        56 B |          4.67 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  60.64 ms | 0.253 ms |  0.237 ms |  4.79 |    0.03 |          - |          - |         - |      1338 B |        111.50 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 392.64 ms | 7.740 ms | 13.956 ms | 31.02 |    1.11 | 17000.0000 | 14000.0000 | 2000.0000 | 192693192 B | 16,057,766.00 |

So this looks likely to be a JIT issue of some kind. (It might still be something that System.Text.Json runs into, but it doesn't appear to be a change in that library that causes this.)

@huoyaoyuan huoyaoyuan added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 20, 2024
@mwadams
Copy link
Contributor

mwadams commented Aug 20, 2024

Running the same assemblies compiled for net80 on both the .NET 8 and .NET 9 runtimes produces the same results, so it looks very likely that this is a JIT-related regression.

| Method                            | Runtime  | RunStrategy | UnrollFactor | Mean      | Error    | StdDev    | Ratio | RatioSD | Gen0       | Gen1       | Gen2      | Allocated   | Alloc Ratio   |
|---------------------------------- |--------- |------------ |------------- |----------:|---------:|----------:|------:|--------:|-----------:|-----------:|----------:|------------:|--------------:|
| ValidateLargeArrayCorvusV3        | .NET 8.0 | Throughput  | 16           |  12.64 ms | 0.096 ms |  0.085 ms |  1.00 |    0.00 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusV4        | .NET 8.0 | Throughput  | 16           |  13.01 ms | 0.200 ms |  0.187 ms |  1.03 |    0.01 |          - |          - |         - |        12 B |          1.00 |
| ValidateLargeArrayCorvusValidator | .NET 8.0 | Throughput  | 16           |  22.39 ms | 0.071 ms |  0.067 ms |  1.77 |    0.01 |          - |          - |         - |       920 B |         76.67 |
| ValidateLargeArrayJsonEverything  | .NET 8.0 | Throughput  | 16           | 389.74 ms | 7.713 ms | 15.582 ms | 29.71 |    0.83 | 17000.0000 | 14000.0000 | 2000.0000 | 193948480 B | 16,162,373.33 |
| ValidateLargeArrayCorvusV3        | .NET 9.0 | Throughput  | 16           |  42.43 ms | 0.261 ms |  0.244 ms |  3.36 |    0.03 |          - |          - |         - |        65 B |          5.42 |
| ValidateLargeArrayCorvusV4        | .NET 9.0 | Throughput  | 16           |  38.33 ms | 0.151 ms |  0.134 ms |  3.03 |    0.02 |          - |          - |         - |        77 B |          6.42 |
| ValidateLargeArrayCorvusValidator | .NET 9.0 | Throughput  | 16           |  60.24 ms | 0.253 ms |  0.224 ms |  4.76 |    0.04 |          - |          - |         - |      1338 B |        111.50 |
| ValidateLargeArrayJsonEverything  | .NET 9.0 | Throughput  | 16           | 415.50 ms | 8.054 ms |  7.533 ms | 32.83 |    0.69 | 17000.0000 | 14000.0000 | 2000.0000 | 192858880 B | 16,071,573.33 |

@Tornhoof
Copy link
Contributor

That CPU is one of those P/E Core thingies isn't it?
You could try with --affinity, as described here https://github.com/dotnet/BenchmarkDotNet/blob/master/docs/articles/guides/console-args.md
something like --affinity 1 and --affinity 65536 to see if there are any differences

@huoyaoyuan
Copy link
Member

Note the difference of Allocated column. It's probably caused by higher GC pressure.

You can try to use allocation profiler of VS to diagnose what's allocated more.

@EgorBo
Copy link
Member

EgorBo commented Aug 20, 2024

You may want to disable DATAS as well to see if the regression coming from that

<GarbageCollectionAdaptationMode>0</GarbageCollectionAdaptationMode>

or

DOTNET_GCDynamicAdaptationMode=0

envvar.

@mwadams
Copy link
Contributor

mwadams commented Aug 20, 2024

Note the difference of Allocated column. It's probably caused by higher GC pressure.

You can try to use allocation profiler of VS to diagnose what's allocated more.

You can disregard the JsonEverything rows. There is essentially zero allocation and actual no GCs in the Corvus code (which is the code under test). JsonEverything is Greg Dennis's JSON schema code which is our "competitive comparison".

@mwadams
Copy link
Contributor

mwadams commented Aug 20, 2024

That CPU is one of those P/E Core thingies isn't it?
You could try with --affinity, as described here https://github.com/dotnet/BenchmarkDotNet/blob/master/docs/articles/guides/console-args.md
something like --affinity 1 and --affinity 65536 to see if there are any differences

We saw no difference when setting a particular processor affinity.

We've also correlated the benchmark runs with the processor ETW data and determined that it is not being throttled in either case.

@mwadams
Copy link
Contributor

mwadams commented Aug 20, 2024

Where we are now is that we've added the Microsoft Performance CPU usage Diagnoser to the benchmark, so we can use the perf viewer to figure out what is going on.

The code emitted is quite different; notably, the System.Text.Json.TextEquals method seems much slower on .NET 9 and yet that is just a bit of scaffolding around RoS<byte>.SequenceEqual

My current suspicion is that it is a regression related to memory alignment or cacheability of the buffers STJ is allocating (which is also why we see the exact same behaviour with code compiled for net80 or net90 under the net90 runtime).

@EgorBo
Copy link
Member

EgorBo commented Aug 20, 2024

Please attach the complete set of steps to build & run the repro (or, better, a self-contained one)

@EgorBo EgorBo added the needs-author-action An issue or pull request that requires more info or actions from the author. label Aug 20, 2024
@EgorBo
Copy link
Member

EgorBo commented Aug 20, 2024

Although, I managed to run it on the given commit and can't reproduce your numbers on Ryzen 7950x (the only x64 CPU I have). Perhaps, someone in @dotnet/jit-contrib has a similar to Intel Core i7-13800H cpu?

@mwadams
Copy link
Contributor

mwadams commented Aug 21, 2024

Attached is a standalone project with the single relevant benchmark.

Repro106679.zip

The machine on which this reproduces is:
Surface Laptop Studio 2
Processor 13th Gen Intel(R) Core(TM) i7-13800H 2.90 GHz
Installed RAM 64.0 GB (63.8 GB usable)
System type 64-bit operating system, x64-based processor

@huoyaoyuan
Copy link
Member

I can also reproduce this on i9-13900K, which is also Rapter Lake:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-XCKTZZ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-QRHGPG : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2

OutlierMode=RemoveAll  RunStrategy=Throughput
Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidateLargeArrayCorvusV4 .NET 8.0 12.30 ms 0.057 ms 0.053 ms 1.00 0.01 12 B 1.00
ValidateLargeArrayCorvusV4 .NET 9.0 35.54 ms 0.195 ms 0.183 ms 2.89 0.02 71 B 5.92

@huoyaoyuan
Copy link
Member

I'm seeing the same bahavior when using System.Text.Json 9.0 with 8.0 runtime.

JsonDocument.TextEquals contains calls to array pool, which doesn't belong to S.T.Json, can cause allocation and may behave differently by CPU architectures. It looks suspicious.

@huoyaoyuan
Copy link
Member

huoyaoyuan commented Aug 21, 2024

Note that this reproduction is still not good enough. It contains non-trivial amount of third party code at Corvus.Json.ExtendedTypes. Ideally all the interesting path should be available as source.

And also a kindly reminder that ImmutableList<T> is not a performant type. You probably want ImmutableArray<T>.

@mwadams
Copy link
Contributor

mwadams commented Aug 21, 2024

Note that this reproduction is still not good enough. It contains non-trivial amount of third party code at Corvus.Json.ExtendedTypes. Ideally all the interesting path should be available as source.

And also a kindly reminder that ImmutableList<T> is not a performant type. You probably want ImmutableArray<T>.

Corvus.Json.ExtendedTypes is our code under test in this benchmark, so it is not really reducible. The source is available in the commit at the link provided in the OP.

Thanks for the observation about ImmutableArray<T>. Note that the ImmutableList<T> is just a transient in the setup code - it isn't the bit that is under test. And although this is not relevant to the issue at hand, you might be interested in the reason for the use of ImmutableList<T>.

Our readonly structs are essentially unions of a JsonElement and a native dotnet type - e.g. string for JSON strings. In the usual (perf-sensitive) case, we are simply a wrapper around the JsonElement so we benefit from the low-allocation behaviour of that type, and the other half of the union is "default/empty". However, if you mutate a part of the data structure, we use the other half of the union, and create an instance of the non-JsonElement type to hold the mutated part (with the JsonElement now set to default). The unchanged parts continue to wrap the JsonElement backing (this is a little like JsonNode, but as a struct; it also predates JsonNode).

For the non-JsonElement backing for objects and arrays, we require an array-like collection. We have to use ImmutableList<T> not ImmutableArray<T> because the data structure is recursive (i.e. the items/properties of arrays/objects can themselves be arrays or objects) and you cannot create an ImmutableArray<T> where instances of T themselves contain an ImmutableArray<T>.

@huoyaoyuan
Copy link
Member

I think CPU profiling is misleading in this case.

In detailed statement-level view, it shows significant difference in return statements of two methods:

image
image

They can potentially be GC-hijacked returns and actually representing the time occupied by GC.

@mwadams
Copy link
Contributor

mwadams commented Aug 21, 2024

I think CPU profiling is misleading in this case.

In detailed statement-level view, it shows significant difference in return statements of two methods:

image image

They can potentially be GC-hijacked returns and actually representing the time occupied by GC.

I agree that the mapping of the CPU profiling to methods is a bit misleading. If you look at the JITted code it jumps around all over the place, and looks quite challenging to tie down precisely! One thing we notice is that the jumps in the hot path are now slightly longer in the net9.0 as the code has been slightly reordered, but I don't know if that is significant.

Also, if you correlate the benchmarks with ETW data, I don't believe that there are any GCs in the benchmark period as there is so little GC pressure. [I haven't confirmed that today, though.]

@mwadams
Copy link
Contributor

mwadams commented Aug 21, 2024

They can potentially be GC-hijacked returns and actually representing the time occupied by GC.

I did wonder if this case might be caused by unnecessary copying - that is a static readonly instance of a readonly struct and it should be possible to optimize it all away up and down the stack to a reference to the instance; but I haven't looked at the JITted code to see what is actually happening.

@huoyaoyuan
Copy link
Member

Meanwhile, I can't find anything allocated on the heap in the core loop by doing memory profiling or heap snapshots. It's also strange why BDN reports heap allocation difference.

@huoyaoyuan
Copy link
Member

huoyaoyuan commented Aug 21, 2024

I have really no more time to help on this. Here are some conclusions so far.

Corvus.Json.ExtendedTypes is our code under test in this benchmark, so it is not really reducible. The source is available in the commit at the link provided in the OP.

Yes, but it's distributed as NuGet package instead of source code in your reproduction. We need source code to ease more AdHoc investigations.


Commenting out ArrayValidationHandler and following part in PersonArray.Validate will result in a more dramatic regression:

Method Runtime Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
ValidateLargeArrayCorvusV4 .NET 8.0 9.814 ns 0.0312 ns 0.0260 ns 1.00 0.00 1,064 B - NA
ValidateLargeArrayCorvusV4 .NET 9.0 88.357 ns 0.3182 ns 0.2976 ns 9.00 0.04 1,057 B - NA

Looking at the disassembly with [DisassemblyDiagnoser(maxDepth: 20)], the beginning part of Corvus.Json.ValidateWithoutCoreType.TypeArray shows a significant difference:

.NET 9:

       push      rdi
       push      rsi
       push      rbx
       sub       rsp,50
       xor       eax,eax
       mov       [rsp+28],rax
       vxorps    xmm4,xmm4,xmm4
       vmovdqu   ymmword ptr [rsp+30],ymm4
       mov       rbx,rcx
       mov       edi,edx
       mov       rsi,r8
       cmp       dil,2
       jne       short M02_L02
       cmp       r9d,3
       je        short M02_L01
       mov       rcx,rbx
       cmp       [rcx],cl
       mov       rdx,rsi
       cmp       [rdx],dl
       mov       r8d,38
       call      qword ptr [7FF8C1225620]; System.Buffer.BulkMoveWithWriteBarrier(Byte ByRef, Byte ByRef, UIntPtr)
M02_L00:
       mov       rax,rbx
       add       rsp,50
       pop       rbx
       pop       rsi
       pop       rdi
       ret

.NET 8:

       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,48
       vzeroupper
       vxorps    xmm4,xmm4,xmm4
       vmovdqa   xmmword ptr [rsp+20],xmm4
       vmovdqa   xmmword ptr [rsp+30],xmm4
       xor       eax,eax
       mov       [rsp+40],rax
       mov       rbx,rcx
       mov       ebp,edx
       mov       rsi,r8
       cmp       bpl,2
       jne       short M02_L01
       cmp       r9d,3
       je        near ptr M02_L10
       mov       rdi,rbx
       call      CORINFO_HELP_ASSIGN_BYREF
       call      CORINFO_HELP_ASSIGN_BYREF
       movsq
       call      CORINFO_HELP_ASSIGN_BYREF
       call      CORINFO_HELP_ASSIGN_BYREF
       call      CORINFO_HELP_ASSIGN_BYREF
       call      CORINFO_HELP_ASSIGN_BYREF
M02_L00:
       mov       rax,rbx
       add       rsp,48
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       ret

But I'm really not sure whether it's the culprit.
It also helps if the methods are directly editable as source.


My further suggestion is to split the reproduction into smaller pieces and see the performance and disassembly of them. It needs to be narrowed down to become more actionable.

You can also try each preview of .NET 9 to help identify which commit range causes the difference.

@mwadams
Copy link
Contributor

mwadams commented Aug 21, 2024

I have pulled in all the essential pieces as source in a single project.

Repro106679-2.zip

I have then added several additional benchmarks.

Doing a simple string validation is 2x faster on .NET 9 v. .NET 8

Method Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
ValidateStringCorvusV4 .NET 8.0 10.749 ns 0.0332 ns 0.0294 ns 1.00 - NA
ValidateStringCorvusV4 .NET 9.0 5.164 ns 0.0263 ns 0.0246 ns 0.48 - NA

However, calling Validate on PersonNameElement is 3x slower on .NET 9 v. .NET 8

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidatePersonNameElementCorvusV4 .NET 8.0 126.273 ns 0.6925 ns 0.6477 ns 1.00 0.01 - NA
ValidatePersonNameElementCorvusV4 .NET 9.0 392.391 ns 0.9471 ns 0.7909 ns 3.11 0.02 - NA

However! If we comment out the code that gets the actual raw JSON text we see that the .NET 9 code is back to being 2x faster.

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidatePersonNameElement2CorvusV4 .NET 8.0 11.072 ns 0.0411 ns 0.0384 ns 0.09 0.00 - NA
ValidatePersonNameElement2CorvusV4 .NET 9.0 4.132 ns 0.0286 ns 0.0267 ns 0.03 0.00 - NA

Driling into this, I then tried benchmarking the code that gets the raw JSON text. Again that is faster under .NET 9 than .NET 8. (GetString)

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 70.16 ns 0.263 ns 0.233 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 66.86 ns 0.488 ns 0.407 ns 0.95 0.0050 64 B 1.00

Similarly, it is still faster if I create and pass in the same context as I use in the real validation code. (GetStringWithValidationContext)

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 180.8 ns 1.31 ns 1.23 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 157.5 ns 0.92 ns 0.77 ns 0.87 0.0050 64 B 1.00

Howver, if I update the validation context and return it, the result is 33% slower in .NET 9 than .NET 8.

This suggests to me that there is a regression with readonly structs passed by ref.

Method Runtime Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
ProcessString .NET 8.0 195.3 ns 0.93 ns 0.87 ns 1.00 0.0050 64 B 1.00
ProcessString .NET 9.0 265.7 ns 1.21 ns 1.13 ns 1.36 0.0048 64 B 1.00

@idg10 idg10 changed the title 3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in System.Text.Json 3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in readonly struct passed by ref Aug 22, 2024
@AndyAyersMS
Copy link
Member

@EgorBo not sure if can we use vzeroupper in the runtime itself -- is it a no-op on SSE2 HW?

@EgorBo
Copy link
Member

EgorBo commented Aug 23, 2024

@EgorBo not sure if can we use vzeroupper in the runtime itself -- is it a no-op on SSE2 HW?

ah good point, that is jit then

@tannergooding
Copy link
Member

tannergooding commented Aug 23, 2024

It is not, it will fault on hardware without AVX support.

We could add a separate code path that checks for AVX support and then uses VEX encoded instructions (either a separate VEX loop or using vzeroupper) but that's more work and risk, and in the case of it being called from a method not using upper state may add hidden cost.

The NeedsVzeroupper API and its use-sites are setup to follow the official optimization manual guidance on where/when to insert vzeroupper, so if we're annotating the helpers correctly things should just work

@AndyAyersMS
Copy link
Member

Thanks @AndyAyersMS and @EgorBo for persisting with this. It's a big deal for us, and it looks like this is actually going to have a positive impact rather than just eliminating a hit!

And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out.

@AndyAyersMS
Copy link
Member

@EgorBo are you going to work up a fix or do you want me to do it?

@EgorBo
Copy link
Member

EgorBo commented Aug 23, 2024

@EgorBo are you going to work up a fix or do you want me to do it?

I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in NeedsVzeroupper

@mwadams
Copy link
Contributor

mwadams commented Aug 23, 2024

And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out.

We just want these graphs to keep going in the right direction :)

https://endjin.com/blog/2023/12/how-dotnet-8-boosted-json-schema-performance-by-20-percent-for-free
https://endjin.com/blog/2023/11/how-dotnet-8-boosted-ais-dotnet-performance-by-27-percent-for-free

@AndyAyersMS
Copy link
Member

@EgorBo are you going to work up a fix or do you want me to do it?

I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in NeedsVzeroupper

Ok, I can put up a fix then.

@mwadams
Copy link
Contributor

mwadams commented Aug 23, 2024

Let me know if there is a build you want me to help verify.

@AndyAyersMS
Copy link
Member

I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).

@EgorBo
Copy link
Member

EgorBo commented Aug 23, 2024

Let me change my benchmark to do some AVX heavy-lifting first:

@EgorBot -intel --runtimes net8.0 net9.0

using System;
using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<MyBench>(args: args);

public class MyBench
{

    [Benchmark]
    public void NotInHeap()
    {
        Struct1 s1 = new();
        Struct1 s2 = new();
        ByrefCopy(ref s1, s2);
    }

    Struct1 dst1;

    [Benchmark]
    public void InHeap_Empty()
    {
        ByrefCopy(ref dst1, default);
    }

    [Benchmark]
    public void InHeap_Ephemeral()
    {
        ByrefCopy(ref dst1, new Struct1(1, null, 3, new Struct2(4, null, 6, null)));
    }

    private byte[] array = new byte[1000];

    [MethodImpl(MethodImplOptions.NoInlining)]
    public bool ByrefCopy(ref Struct1 dst, Struct1 src)
    {
        // Do some inline AVX:
        array.AsSpan(0, 128).Clear();

        dst = src;
        return true;
    }
}

public record struct Struct1(
    object a1, object a2,
    long a3, Struct2 g);

public record struct Struct2(
    object a1, object a2, object a3,
    object a4);

@EgorBot
Copy link

EgorBot commented Aug 23, 2024

Benchmark results on Intel
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-LLCCCZ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-YEZQNO : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Runtime Mean Error Ratio
NotInHeap .NET 8.0 21.78 ns 0.039 ns 1.00
NotInHeap .NET 9.0 12.33 ns 0.002 ns 0.57
InHeap_Empty .NET 8.0 26.23 ns 0.055 ns 1.00
InHeap_Empty .NET 9.0 12.04 ns 0.012 ns 0.46
InHeap_Ephemeral .NET 8.0 44.00 ns 0.450 ns 1.00
InHeap_Ephemeral .NET 9.0 29.37 ns 0.459 ns 0.67

BDN_Artifacts.zip

@mwadams
Copy link
Contributor

mwadams commented Aug 23, 2024

I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).

Quite happy to monkey patch (with instructions!)

@mwadams
Copy link
Contributor

mwadams commented Aug 23, 2024

@EgorBo Running your benchmark: those numbers are quite definitive

Method Runtime Mean Error StdDev Ratio RatioSD
NotInHeap .NET 8.0 10.88 ns 0.095 ns 0.084 ns 1.00 0.01
NotInHeap .NET 9.0 84.15 ns 0.583 ns 0.545 ns 7.74 0.08
InHeap_Empty .NET 8.0 12.00 ns 0.105 ns 0.093 ns 1.00 0.01
InHeap_Empty .NET 9.0 94.68 ns 1.862 ns 1.992 ns 7.89 0.17
InHeap_Ephemeral .NET 8.0 21.63 ns 0.299 ns 0.265 ns 1.00 0.02
InHeap_Ephemeral .NET 9.0 100.32 ns 0.900 ns 0.702 ns 4.64 0.06

@EgorBo
Copy link
Member

EgorBo commented Aug 23, 2024

@EgorBo Running your benchmark: those numbers are quite definitive

Thanks! that definitely confirms the issue @AndyAyersMS found.
It also likely means that Linux is not affected

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Aug 23, 2024

I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october).

Quite happy to monkey patch (with instructions!)

Save and unzip the attached. As admin, copy this DLL to the preview 7 folder (you may want to copy off the existing jit first so you can put things back later). For me this is at

C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.0-preview.7.24405.7

clrjit.zip

This is a checked build of the jit so it will be 5MB+ when unpacked. If you enable disassembly with this in place you will see more verbose outputs.

@EgorBo
Copy link
Member

EgorBo commented Aug 23, 2024

For me this is at

Can be obtained via

dotnet --list-runtimes | Select-String -Pattern "Microsoft.NETCore.App 9.0"

(powershell)

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Aug 23, 2024
The helper uses SSE2, so we need to take care to avoid AVX-SSE
transition penalties.

Fixes dotnet#106679.
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 23, 2024
@EgorBo EgorBo removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Aug 23, 2024
@mwadams
Copy link
Contributor

mwadams commented Aug 24, 2024

With the monkey-patched DLL

Method Runtime Mean Error StdDev Ratio
ProcessString .NET 8.0 212.71 ns 2.491 ns 1.648 ns 1.00
ProcessString .NET 9.0 95.75 ns 0.827 ns 0.733 ns 0.45

And the original 3x perf hit benchmark:

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ValidateLargeArrayCorvusV4 .NET 8.0 13.48 ms 0.187 ms 0.175 ms 1.00 0.02 12 B 1.00
ValidateLargeArrayCorvusV4 .NET 9.0 11.23 ms 0.049 ms 0.045 ms 0.83 0.01 17 B 1.42

We're now running 17% faster on .NET 9.0 (which puts the graph back on track!)

github-actions bot pushed a commit that referenced this issue Aug 25, 2024
The helper uses SSE2, so we need to take care to avoid AVX-SSE
transition penalties.

Fixes #106679.
jeffschwMSFT added a commit that referenced this issue Aug 26, 2024
…rier helper (#106937)

* JIT: emit vzeroupper before calls to the bulk write barrier helper

The helper uses SSE2, so we need to take care to avoid AVX-SSE
transition penalties.

Fixes #106679.

* review feedback

---------

Co-authored-by: Andy Ayers <andya@microsoft.com>
Co-authored-by: Jeff Schwartz <jeffschw@microsoft.com>
@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2024
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this issue Dec 10, 2024
…otnet#106908)

The helper uses SSE2, so we need to take care to avoid AVX-SSE
transition penalties.

Fixes dotnet#106679.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants