-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve JSON (de)serialization perf of longer strings on mono #39733
Conversation
} | ||
|
||
private static unsafe int IndexOfOrLessThan(ref byte searchSpace, byte value0, byte value1, byte lessThan, int length) | ||
private static unsafe int IndexOfOrLessThan(ReadOnlySpan<byte> span, byte value0, byte value1, byte lessThan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What length is the break-even point where this is faster than the vectorized code below?
I am surprised this provides wins over the loop-unrolled sequential scan. Where is the overhead coming from?
This code was inspired by the general-purpose IndexOf method on Span<T>
, here:
public static unsafe int IndexOf(ref byte searchSpace, byte value, int length) |
Ideally, we should fix-up and make that method close to fastest for all input lengths to avoid having two separate code-paths, since that is used in a lot more places. I am not sure it is worth it to make a change like this here, just specific to JSON reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What length is the break-even point where this is faster than the vectorized code below?
The manual loop is always faster when running under the Mono interpreter (I tried from 10 characters to 10,240 characters as per the table above).
It would be good to know when else Vector.IsHardwareAccelerated
is false so I could run some benchmarks other than Mono interpreter..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about less than 10 characters? For example, it is common for property names to be between 3-6 chars.
The manual loop is always faster when running under the Mono interpreter
What about the regular JIT/compiled environments? I am concerned this will hurt those environments, particularly for larger strings which could benefit from loop unrolling.
I believe there is a way to turn of IsHardwareAccelerated
through some JIT flag or environment variable to simulate it. @AndyAyersMS might know. You could also just assume IsHardwareAccelerated
is false in your local testing and remove that branch to see the impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to echo what @jkotas mentioned here: #39323 (comment)
We should try to avoid trade-offs that try to improve the interpreter environment at the cost of JIT environments and code maintainability/duplication.
However, if all cases improve, then I can see an argument to be made for such a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run with COMPlus_EnableHWIntrinsic=0
to disable hardware intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing IndexOfOrLessThan method should work just fine on platforms without SIMD/Vectors. Vector.IsHardwareAccelerated
will be constant and then it will go into manually unrolled non-SIMD loop.
I believe that this change will make the performance measurably worse on platforms without SIMD/Vector acceleration (e.g. CoreCLR arm) because we will take this version that does not have the loop unrolled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may also be SIMD support for WASM eventually. My understanding is that Google already supports it in V8 and there is an entire repo + proposals/extensions attempting to officially add SIMD as an available extension: WebAssembly/simd#31 and https://webassembly.org/roadmap/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about less than 10 characters
The perf is not affected when <= 10 chars under the interpreter (within the margin of error).
I increased the iterations to 100_000 with string length of 3:
Before
serialize:Serialization took 1536 ms
deserialize:Deserialization took 2141 ms
After
serialize:Serialization took 1550 ms
deserialize:Deserialization took 2072 ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These optimizations are focused on improving the WASM interpreter which doesn't have SIMD/Vectors in the MVP platform. The numbers appear to show that the JIT doesn't suffer measurably?
This PR was intended to help with Mono interpreter, but helps even more with Mono JIT. However, the original code path remains for Core JIT unless Vector.IsHardwareAccelerated==false
which means there is no hardware-accelerated SIMD support.
The benchmarks were performed against the raw Mono interpreter (not running under WASM) meaning the interpreter today doesn't support SIMD acceleration irregardless of running under WASM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the regular JIT/compiled environments? I am concerned this will hurt those environments, particularly for larger strings which could benefit from loop unrolling.
Regular JIT/compiled environment use the existing code path unless Vector.IsHardwareAccelerated==false.
I believe that this change will make the performance measurably worse on platforms without SIMD/Vector acceleration (e.g. CoreCLR arm) because we will take this version that does not have the loop unrolled.
Thanks for the CoreCLR arm reference. As I mentioned previously, I wasn't sure what other platforms Vector.IsHardwareAccelerated would be false.
For a manual test, I commented out the first branch in IndexOfOrLessThan(ReadOnlySpan<byte> span
if (length >= Vector<byte>.Count * 2)
{
int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector<byte>.Count - 1);
nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1));
}
to avoid the Vector optimizations. The large 10K string test ran ~1.75x slower, so yes we should do something. I believe the unsafe code in the loop unrolling is causing the interpreter to be slow. There is currently not an option to detect the interpreter, and probably not something we want to expose anyway.
However, there is RuntimeFeature.IsDynamicCodeCompiled
which will be false under the interpreter and could be used in conjunction with Vector.IsHardwareAccelerated
to ensure the manual loop that was added in this PR is only run when both are false. Thoughts?
...ystem.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoderBasicLatin.cs
Outdated
Show resolved
Hide resolved
The browser test failure is #39473 |
...ystem.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoderBasicLatin.cs
Show resolved
Hide resolved
src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.cs
Outdated
Show resolved
Hide resolved
Another option is to have the interpreter intrinsify |
For the serialization perf gain scenarios, the changes here are still necessary here unless we have the interpreter instrinsify it instead. If we decide to intrinsify, then I will remove the changes to STJ here and create a Mono issue to intrinsify it. @BrzVlad what do you think about the intrinsifying
What are the "two separate code-paths"? Within This method could also be intrinsified by the interpreter (as well as |
@steveharter Those checks in libraries code are not the prettiest and intrinsifying seems very easy and would also lead to much faster code. How much time do you estimate is being spent in this method ? I don't remember noticing it in my original benchmark. |
@BrzVlad you'll need a test with larger strings (say 1K to effectively measure). I'll remove the changes to S.T.J.dll in this PR and assume intrinsifying You can track or measure Note that only deserialization is affected - the other changes to the S.T.E.W.dll library that affect serialization in this PR can't be intrinsified and I'll keep in this PR. |
e54c40b
to
8a44188
Compare
Failure is unrelated:
|
Speeds up json deserialization by 5-10% Following discussion from dotnet/runtime#39733
Improves the serialization and deserialization performance of the Mono interpreter + POCO with a 1K string property by ~1.4x
Improves the serialization performance of Mono JIT + POCO with a 1K string property by ~2.5x and deserialization by 1.1x.
The same tests but with the property only 10 characters (instead of 1,024) was too small to effectively measure (without increasing the loop count).
No regressions noted with Core JIT.
The changes to S.T.E.W.dll are responsible for the serialization gains and the S.T.J.dll changes are responsible for the deserialization gains.
The test POCO is the existing
SimpleStructWithProperties
which was used since it contains the most difference in various benchmarks measuring Mono JIT vs Mono interpreter.The serialization gains were from modifying System.Text.Encodings.Web avoiding accessing a
static ReadOnly<byte>
table for every character; the assumption is the Core JIT's support of a tiered JIT avoids the overhead of checking whether the table was initialized (after a few runs). In addition, some interpreter gains due to avoiding the overhead of calling a method for every character (while escaping).The deserialization gains were from modifying System.Text.Json to optimize the check for the first characters to unescape (typically the trailing quote) by avoiding using unsafe pointers.
The tests were measured by running the benchmark 5 times and picking the fastest. The loop executed 10,000 times.