-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3x perf regression on 13th Gen Intel Core (i7-13800H) maybe in readonly struct passed by ref #106679
Comments
How about installing 9.0 package of System.Text.Json on 8.0 runtime? This helps identifying the problem in JIT or new code pattern used in System.Text.Json. |
Using the 9.0
So this looks likely to be a JIT issue of some kind. (It might still be something that |
Running the same assemblies compiled for net80 on both the .NET 8 and .NET 9 runtimes produces the same results, so it looks very likely that this is a JIT-related regression.
|
That CPU is one of those P/E Core thingies isn't it? |
Note the difference of You can try to use allocation profiler of VS to diagnose what's allocated more. |
You may want to disable DATAS as well to see if the regression coming from that
or
envvar. |
You can disregard the JsonEverything rows. There is essentially zero allocation and actual no GCs in the Corvus code (which is the code under test). JsonEverything is Greg Dennis's JSON schema code which is our "competitive comparison". |
We saw no difference when setting a particular processor affinity. We've also correlated the benchmark runs with the processor ETW data and determined that it is not being throttled in either case. |
Where we are now is that we've added the Microsoft Performance CPU usage Diagnoser to the benchmark, so we can use the perf viewer to figure out what is going on. The code emitted is quite different; notably, the System.Text.Json.TextEquals method seems much slower on .NET 9 and yet that is just a bit of scaffolding around My current suspicion is that it is a regression related to memory alignment or cacheability of the buffers STJ is allocating (which is also why we see the exact same behaviour with code compiled for net80 or net90 under the net90 runtime). |
Please attach the complete set of steps to build & run the repro (or, better, a self-contained one) |
Although, I managed to run it on the given commit and can't reproduce your numbers on Ryzen 7950x (the only x64 CPU I have). Perhaps, someone in @dotnet/jit-contrib has a similar to Intel Core i7-13800H cpu? |
Attached is a standalone project with the single relevant benchmark. The machine on which this reproduces is: |
I can also reproduce this on i9-13900K, which is also Rapter Lake:
|
I'm seeing the same bahavior when using System.Text.Json 9.0 with 8.0 runtime.
|
Note that this reproduction is still not good enough. It contains non-trivial amount of third party code at And also a kindly reminder that |
Corvus.Json.ExtendedTypes is our code under test in this benchmark, so it is not really reducible. The source is available in the commit at the link provided in the OP. Thanks for the observation about Our readonly structs are essentially unions of a For the non- |
I agree that the mapping of the CPU profiling to methods is a bit misleading. If you look at the JITted code it jumps around all over the place, and looks quite challenging to tie down precisely! One thing we notice is that the jumps in the hot path are now slightly longer in the net9.0 as the code has been slightly reordered, but I don't know if that is significant. Also, if you correlate the benchmarks with ETW data, I don't believe that there are any GCs in the benchmark period as there is so little GC pressure. [I haven't confirmed that today, though.] |
I did wonder if this case might be caused by unnecessary copying - that is a |
Meanwhile, I can't find anything allocated on the heap in the core loop by doing memory profiling or heap snapshots. It's also strange why BDN reports heap allocation difference. |
I have really no more time to help on this. Here are some conclusions so far.
Yes, but it's distributed as NuGet package instead of source code in your reproduction. We need source code to ease more AdHoc investigations. Commenting out
Looking at the disassembly with .NET 9: push rdi
push rsi
push rbx
sub rsp,50
xor eax,eax
mov [rsp+28],rax
vxorps xmm4,xmm4,xmm4
vmovdqu ymmword ptr [rsp+30],ymm4
mov rbx,rcx
mov edi,edx
mov rsi,r8
cmp dil,2
jne short M02_L02
cmp r9d,3
je short M02_L01
mov rcx,rbx
cmp [rcx],cl
mov rdx,rsi
cmp [rdx],dl
mov r8d,38
call qword ptr [7FF8C1225620]; System.Buffer.BulkMoveWithWriteBarrier(Byte ByRef, Byte ByRef, UIntPtr)
M02_L00:
mov rax,rbx
add rsp,50
pop rbx
pop rsi
pop rdi
ret .NET 8: push rdi
push rsi
push rbp
push rbx
sub rsp,48
vzeroupper
vxorps xmm4,xmm4,xmm4
vmovdqa xmmword ptr [rsp+20],xmm4
vmovdqa xmmword ptr [rsp+30],xmm4
xor eax,eax
mov [rsp+40],rax
mov rbx,rcx
mov ebp,edx
mov rsi,r8
cmp bpl,2
jne short M02_L01
cmp r9d,3
je near ptr M02_L10
mov rdi,rbx
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
M02_L00:
mov rax,rbx
add rsp,48
pop rbx
pop rbp
pop rsi
pop rdi
ret But I'm really not sure whether it's the culprit. My further suggestion is to split the reproduction into smaller pieces and see the performance and disassembly of them. It needs to be narrowed down to become more actionable. You can also try each preview of .NET 9 to help identify which commit range causes the difference. |
I have pulled in all the essential pieces as source in a single project. I have then added several additional benchmarks. Doing a simple string validation is 2x faster on .NET 9 v. .NET 8
However, calling Validate on PersonNameElement is 3x slower on .NET 9 v. .NET 8
However! If we comment out the code that gets the actual raw JSON text we see that the .NET 9 code is back to being 2x faster.
Driling into this, I then tried benchmarking the code that gets the raw JSON text. Again that is faster under .NET 9 than .NET 8. (
Similarly, it is still faster if I create and pass in the same context as I use in the real validation code. (
Howver, if I update the validation context and return it, the result is 33% slower in .NET 9 than .NET 8. This suggests to me that there is a regression with readonly structs passed by ref.
|
@EgorBo not sure if can we use |
ah good point, that is jit then |
It is not, it will fault on hardware without AVX support. We could add a separate code path that checks for AVX support and then uses VEX encoded instructions (either a separate VEX loop or using The |
And thanks to you and @idg10 for trying preview 7 and reporting the problem. Likely this might have gone unnoticed until the official release came out. |
@EgorBo are you going to work up a fix or do you want me to do it? |
I can if you want me to do so 🙂 but if Tanner is right, it is supposed to be a single-line like change in |
We just want these graphs to keep going in the right direction :) https://endjin.com/blog/2023/12/how-dotnet-8-boosted-json-schema-performance-by-20-percent-for-free |
Ok, I can put up a fix then. |
Let me know if there is a build you want me to help verify. |
I can get you a 9p7 compatible jit with a fix, if you're comfortable monkey-patching it onto your existing runtime install. Otherwise there won't be a fixed 9.0 version until RC2 (mid october). |
Let me change my benchmark to do some AVX heavy-lifting first: @EgorBot -intel --runtimes net8.0 net9.0 using System;
using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<MyBench>(args: args);
public class MyBench
{
[Benchmark]
public void NotInHeap()
{
Struct1 s1 = new();
Struct1 s2 = new();
ByrefCopy(ref s1, s2);
}
Struct1 dst1;
[Benchmark]
public void InHeap_Empty()
{
ByrefCopy(ref dst1, default);
}
[Benchmark]
public void InHeap_Ephemeral()
{
ByrefCopy(ref dst1, new Struct1(1, null, 3, new Struct2(4, null, 6, null)));
}
private byte[] array = new byte[1000];
[MethodImpl(MethodImplOptions.NoInlining)]
public bool ByrefCopy(ref Struct1 dst, Struct1 src)
{
// Do some inline AVX:
array.AsSpan(0, 128).Clear();
dst = src;
return true;
}
}
public record struct Struct1(
object a1, object a2,
long a3, Struct2 g);
public record struct Struct2(
object a1, object a2, object a3,
object a4); |
Benchmark results on Intel
|
Quite happy to monkey patch (with instructions!) |
@EgorBo Running your benchmark: those numbers are quite definitive
|
Thanks! that definitely confirms the issue @AndyAyersMS found. |
Save and unzip the attached. As admin, copy this DLL to the preview 7 folder (you may want to copy off the existing jit first so you can put things back later). For me this is at
This is a checked build of the jit so it will be 5MB+ when unpacked. If you enable disassembly with this in place you will see more verbose outputs. |
Can be obtained via dotnet --list-runtimes | Select-String -Pattern "Microsoft.NETCore.App 9.0" (powershell) |
The helper uses SSE2, so we need to take care to avoid AVX-SSE transition penalties. Fixes dotnet#106679.
With the monkey-patched DLL
And the original 3x perf hit benchmark:
We're now running 17% faster on .NET 9.0 (which puts the graph back on track!) |
The helper uses SSE2, so we need to take care to avoid AVX-SSE transition penalties. Fixes #106679.
…rier helper (#106937) * JIT: emit vzeroupper before calls to the bulk write barrier helper The helper uses SSE2, so we need to take care to avoid AVX-SSE transition penalties. Fixes #106679. * review feedback --------- Co-authored-by: Andy Ayers <andya@microsoft.com> Co-authored-by: Jeff Schwartz <jeffschw@microsoft.com>
…otnet#106908) The helper uses SSE2, so we need to take care to avoid AVX-SSE transition penalties. Fixes dotnet#106679.
Description
One of the benchmarks in our
Corvus.JsonSchema
library has wildly different performance characteristics on .NET 9.0 on certain hardware:On the (rather old) Coffee Lake CPU, we see what you'd hope: .NET 9.0 is significantly faster than .NET 8.0.
On the much newer CPU (in a Surface Laptop Studio 2), in .NET 8.0 the benchmark runs a lot faster than on the old CPU, as is typical with a CPU that much newer. But running the same code on .NET 9.0 on that newer CPU is almost 3 times slower than with .NET 8.0 on the same CPU. (It's significantly slower even than .NET 8.0 on the much older CPU.)
To reproduce this, clone the repo from commit 2621745, run the
Corvus.Json.Benchmark
project, and select theValidateLargeDocument
benchmark.The figures in the table above are for the
ValidateLargeArrayCorvusV3
benchmark, but we see similar regressions (again, only on the newer CPU) for theValidateLargeArrayCorvusV4
andValidateLargeArrayCorvusValidator
benchmarks.We haven't yet succeeded in isolating whatever it is about this benchmark that produces these effects. So far our attempts to profile the code outside of BenchmarkDotNet haven't reproduced the issue. The code in question makes heavy use of
System.Text.Json
, so that's where we suspect the issue lies, but we can't prove that.Configuration
NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2 and .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
On the Coffee Lake machine (which doesn't have this problem) we're running Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3).
The newer machine (which exhibits the problem) is running Windows 11 (10.0.22621.4037/22H2/2022Update/SunValley2).
x64, Intel Core.
On the machine that does not reproduce the problem, the CPU is as already described. The machine has 64GB of memory. (I can provide more details if required.)
The machine on which we see the problem is a Surface Laptop Studio 2 with 64GB of RAM.
Regression
We're seeing the regression from .NET 8.0.7 to .NET 9.0.0-preview.7.24405.7 (BenchmarkDotNet reports this as 9.0.24.40506.)
We first observed this with .NET 9.0.0-preview.6.
Data
The
Throughput
benchmark results on the older (Coffee Lake) CPU (on which we don't see the regression) are:As you can see, on that CPU .NET 9.0 does better than .NET 8.0.
Here are the same results for the 13th gen CPU in the Surface Laptop Studio 2:
As you can see, the .NET 9.0 numbers here (same preview build of .NET 9.0 - preview.7) for the first 3 benchmarks are significantly slower than for .NET 8.0. (And they are significantly slower than the same benchmarks on the much older CPU for either .NET 8.0 or 9.0 preview 7.)
The text was updated successfully, but these errors were encountered: