-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize System.Buffers for arm64 using cross-platform intrinsics #35033
Comments
Tagging subscribers to this area: @tannergooding |
I can give this a shot |
These ARM64 intrinsics are missing to make the code "worthwhile":
Tracked by #33398:
These helper methods could be accelerated to help it and are tracked in #33496:
These intrinsics could be introduced to maximise perf:
|
Draft commit, most that can be done until necessary intrinsics are added |
Thanks @john-h-k, as per our conversation on Discord, I'll ping you once the necessary intrinsics are added so you can make further progress. |
@tannergooding @john-h-k The shift intrinsics went in with #36830. There have been many improvements to System.Runtime.Intrinsics optimization in #33496. Is this work now unblocked? |
Yes, this should be unblocked now. |
@john-h-k Are you planning to try and finish this for .NET 5, now that it's unblocked? |
If not, we'll have 3-4 people free to help tackle the remaining issues from #33308 on Monday. CC. @jeffhandley |
Yes, I should be. I just need to benchmark the 2 possible impls I've got locally, either on my old Pi2b or I can get a friend to do it on newer device |
@tannergooding @jeffhandley What's the plan for this for .NET 5? |
@tannergooding is planning to carry this one over the finish line ahead of RC1. |
We could not get this completed in time for 5.0.0; moved to 6.0.0. |
@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this. Edit: For some reason, I am not able to assign this. |
Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. |
I'm actively working on the support locally. Hoping to get it done soon but needing to balance that work against the Generic Math work. |
|
I can't assign this either. It should go to @SwapnilGaikwad instead of me. |
IMO it's better to wait for #63331 I'm also going to update https://github.com/gfoidl/Base64, and there by a scheduled CI-trigger fuzz-tests are run, so any fault should get caught early enough for .NET 7 RTM. |
Noting that only the The |
We are looking at this.... With a recent HEAD, took the ss2 C# code and (mostly) updated it to use Vector128. Annotated IR dump of part of the function:
|
The above dump was using |
Can you share the managed code that was being used here, preferably including the |
This code was made by taking Ssse3Decode(), replacing Sse2 with Vector128, then fixing up the function calls. Quite possible there's an obvious mistake on my end.
|
Although - Lack of an Arm64 version of shuffle isn't surprising given there is not obvious match to the instruction |
From the top of Base64Decoder.cs: // AVX2 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/avx2 Plan is to base the Arm64 version on the Neon version from the same repo: Due to this being based on table looks instead of shuffles, there is likely to be no sharing between the X86 and Arm64 version. |
And that should be fine. |
In order to reuse the aklomp algorithm, there are still missing API functions:
These all require sequential registers - which AIUI is quite a big task to implement. Can we still do the routine without them?
|
@a74nh - Just FYI, we have hardware intrinsics APIs exposed under https://source.dot.net/#System.Private.CoreLib/AdvSimd.PlatformNotSupported.cs,f54d7472d19d4e2b |
can the byte-versions be used instead here? I see the x64 code uses byte indices |
Those are the single register version - which we can use but we'll need to use 4 of them. So, doable but not optimal. |
An implementation of DecodeFromUtf8 without using LD4/ST3/TBX4 is here: #70336 Currently getting 3x slowdown using it. If can't make it faster, then probably best waiting for LD4/ST3/TBX4 |
Closed by #70654 |
This item tracks the conversion of the System.Buffers class to use arm64 intrinsics.
Update for .NET 7
Now that we have cross-platform intrinsics APIs (#49397), these optimizations should be completed using those APIs instead of adding ARM64-specific intrinsics code paths. The effort could optionally include first measuring performance of these methods with the ARM64-specific intrinsics in place, and then measuring the performance of these methods with the cross-platform intrinsics.
Example use of the new cross-platform intrinsics: #63722
Related: #33308
The text was updated successfully, but these errors were encountered: