-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the fletcher4 neon implementation #14219
base: master
Are you sure you want to change the base?
Conversation
2507c0e
to
10c0591
Compare
On which platform are those results? I worry that this might be better on some platforms and worse on others. |
Hi @rincebrain, thanks for the comment. The implementation just use 2 add pipeline concurrently, so that in each cycle, the Arm64 CPU can out of order to run the adding operation to accelerate. I got the result on Arm Neoverse N1(160C, 512G, NVME), Ampere Altra. Besides, I have test on other 2 platform and got the same improvement: Arm v8.2 8C16G VM(Huawei Cloud, should be Kunpeng 920):
On the Alibaba Cloud, Yitian 710, Arm V9 2C 8G NVME:
|
It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something? On x86 I think it's universally faster to use SIMD for the checksums. |
@@ -88,7 +88,7 @@ typedef struct zfs_fletcher_avx512 { | |||
} zfs_fletcher_avx512_t; | |||
|
|||
typedef struct zfs_fletcher_aarch64_neon { | |||
uint64_t v[2] __attribute__((aligned(16))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we drop the aligned attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a patch doing that pending anyway, because we're lying when we claim that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure here, since I re-write the Neon fletcher4 according to the superscalar4, so I need to align with the superscalar4 data structure..
One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work? |
Thinking about this some more, it would make sense that custom processors used by hyperscalers would emulate NEON instructions in scalar operations. Emulation would allow the die area to be reused for other things like more cores or more cache, or simply reduced for better yields. If I were designing a processor for a hyperscaler, emulating the NEON instructions seems like it would be a fairly straightforward decision, especially since most of their customers are moving SIMD workloads to GPGPU code these days. |
Usually the Neon implementation is better than the superscalar ones, according to the test, on much of the CPUs the Neon one is better than superscarlar one. But one some dedicated CPU ,the NEON instructions latency is much larger than the command CPU instructions, so that that is the hardware problem and will be repaired in the future version. So generally the NEON one is better. |
Yes another reason is maybe @ryao referred, the test on hyperscalers is running on the VM, and the Neon better result I got is from a Baremetal machine. |
If it is not too much trouble, it would be interesting to see benchmark numbers from an Apple Silicon ARM machine. Their hardware definitely has a performant NEON implementation, since Rosetta 2 translates SSE2 instructions into NEON instructions: https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/ You would want to use Asahi Linux for that: |
At least from my experience with it on zlib-ng with the adler checksum, native neon is for sure faster with m1 and many SIPs based on the cortex family. A possible suggestion based on work @dougallj has done on the zlib side of things: has anyone implemented Fletcher4 with umla2? It'd require 2 instructions for both halves but it would both widen and accumulate back to a, b, c, & d. You'd have to do runtime detection to see if it's supported but I suspect it might be faster. Edit: err perhaps this was a stupid suggestion given the vectorized Fletcher implementations don't actually multiply the mixing matrix until the finalization of the checksum and can handle things instead with a straight addition without the need to handle potential overflow at each sum with a rebase like adler does. |
asm("ld1 { %[AA2].4s, %[A3A4].4s }, %[CTX0]\n" \ | ||
"ld1 { %[BB2].4s, %[B3B4].4s }, %[CTX1]\n" \ | ||
"ld1 { %[CC2].4s, %[C3C4].4s }, %[CTX2]\n" \ | ||
"ld1 { %[DD2].2d, %[D3D4].2d }, %[CTX3]\n" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a pre-existing issue, but this likely is broken on big endian:
https://gist.github.com/lirenlin/8647388
ldr
might be a better choice here. The two probably perform the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryao Sorry for late response here. The ld1-ld4 are used for Neon register dedicated, and usually it's easy to use ld1-ld4 to operate 4 register once. But for ldr, easy command it can just operate one register[1]. So the ld1-ld4 can actually save some cpu cycle when load the data from memory.
In my opinion, we don't worry about the Big endian issue. The link you referred before is showing the issue for ld1 when it on big endian. Arm64 supports both little endian and big endian, but the endianess choice is according to the IMPLEMENT DEFINED[2], actually there is no real hardware for Arm64 big endian. The lack of user cases and gaps for the upper layer software compliance prevent the Arm64BE. E.g, there is no Golang supported for Arm64BE[3].
for (; ip < ipend; ip += 2) { | ||
NEON_MAIN_LOOP(NEON_DONT_REVERSE); | ||
for (; ip < ipend; ip += 4) { | ||
NEON_MAIN_LOOP_LOAD(NEON_DO_NOT_REVERSE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful if we had the assembly/disassembly from this loop to study. For example, Clang generates the following for a version of the fletcher4 code that I will probably put into a PR soon:
.LBB0_2: // =>This Inner Loop Header: Depth=1
ldr q16, [x1], #16
uaddw v4.2d, v4.2d, v16.2s
uaddw2 v0.2d, v0.2d, v16.4s
add v1.2d, v0.2d, v1.2d
add v5.2d, v4.2d, v5.2d
add v6.2d, v5.2d, v6.2d
add v2.2d, v1.2d, v2.2d
add v3.2d, v2.2d, v3.2d
add v7.2d, v6.2d, v7.2d
cmp x1, x8
b.lo .LBB0_2
I am not an expert on ARM assembly, but this looks as tight as possible. It is unfortunate that GCC generates much worse assembly from the same code that Clang compiled to generate this. The NEON_MAIN_LOOP_LOAD
code looks similar to this, although I wonder what GCC generated around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice point, I will re-write the asm here to follow this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a sandbox you can use to see the assembly generated by clang:
https://gcc.godbolt.org/z/aP9ff3jPh
I realize that the x86-64 “version” of clang was used, but the beauty of LLVM is that one binary can support multiple architectures by setting the target. Since that site does not have a clang explicitly for aarch64/arm, I set a target as a compiler flag instead.
You can grab the byteswap version from #14234 and put it there to see what is generated for the byteswap case. Make sure to drop the static keyword when you do or else you will not get any assembly since the compiler will think it is dead code.
If you are compiling the code for arm/aarch64 locally, you can use objdump -d /path/to/zfs_fletcher_aarch64_neon.o
to see the assembly that gcc generated. I suggest building with --enable-debuginfo if you do, so that the static symbols will be marked properly in the disassembly. That makes it fairly simple to identify the main loop. You just need to look for the basic block that can jump to its own start inside the function’s disassembly.
NEON_MAIN_LOOP(NEON_DONT_REVERSE); | ||
for (; ip < ipend; ip += 4) { | ||
NEON_MAIN_LOOP_LOAD(NEON_DO_NOT_REVERSE); | ||
PREF1KL1(ip, 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we confirmed that prefetch helps here? It is possible for prefetch instructions to be harmful. In this case, on the last loop iteration, this instruction will have prefetched 1008 bytes into L1 cache that are unlikely to ever be used, which means that we are filling the L1 cache with garbage that we are unlikely to use. This might not show a performance issue in the benchmark, but it could slow down the system elsewhere.
In addition, Linux mainline has had various less than flattering things to say about explicit prefetch instructions over the years:
https://lwn.net/Articles/404103/
https://lwn.net/Articles/444344/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryao Thanks for pointing out here, I do have some test on the following platform and get performance increase on Kunpeng 920 VM. For Ampere Altra and Aws Gravtion 3, the pre-fetch L1 does not get any improvement. So I agree with you we can remove the prefetch code here.
Ampere Altra:
With L1 Prefetch
0 0 0x01 -1 0 20253112240840 20293116392520
implementation native byteswap
scalar 4465636046 3986526251
superscalar 6288311844 5471735330
superscalar4 7267898111 5856494177
aarch64_neon 9549582587 7824610103
fastest aarch64_neon aarch64_neon
Without
0 0 0x01 -1 0 20361322942120 20362454024320
implementation native byteswap
scalar 4441989324 4047774560
superscalar 6273451195 5428114404
superscalar4 7273948180 5848654377
aarch64_neon 9448333032 7895309088
fastest aarch64_neon aarch64_neon
Kunpeng 920 VM
With Pre-L1
0 0 0x01 -1 0 28699859899380 28700387328790
implementation native byteswap
scalar 3505946470 3087293256
superscalar 4915305602 4253383496
superscalar4 5473091929 4556896233
aarch64_neon 5199690074 4318504180
fastest superscalar4 superscalar4
without Prefetch L1
0 0 0x01 -1 0 28592205443970 28592735589200
implementation native byteswap
scalar 3485493950 3145971812
superscalar 4925580275 4170573437
superscalar4 5476808168 4574764270
aarch64_neon 4804224295 4147109889
fastest superscalar4 superscalar4
It would also be interesting to check on AWS Graviton 3, with both the original and updated NEON versions, as the added parallelism should shine on the Neoverse V1 core - it has 4 full (128-bits) NEON pipelines. At some point, someone will have to tackle the SVE version of the Fletcher & RAID-Z code, if only to see if that's better than NEON on V1 and A64FX. It's on my TODO list but I haven't found the time yet. |
Hi @rincebrain, I'm also interested in the SVE support for Raidz and Fletcher4. We have the AF64X hardware in the lab, and I plan to do the SVE support after done this NEON one. And below are the test result for AWS Gravtion3, 4C 8G:
|
Beware than A64FX has 512 bits SVE, Graviton 3 (Neoverse V1) has 256 bits. The code need to be scalable, or to adapt to the machine - I haven't looked a Fletcher recently, but IIRC the algorithm is width-dependent (well, number of parallel computations-dependent). Also I don't know for integer, but for FP the SVE latencies are quite high on A64FX, it's a throughput machine, not a latency one.
Thanks! As expected/hoped, NEON is quite good (vs. superscalar4) and the unrolled version really shine with >30% improvement on native.V1 does benefit from those 4 128-bits pipeline on this code. |
It certainly is. I also just discovered that Zen 3 is capable of doing multiple vector additions in parallel, which allowed widening the parallel additions to be a performance win: This effect likely also exists on ARM processors, if any of them are capable of doing multiple vector additions in parallel. i.e. superscalar SIMD |
Yeah, generally this is possible to varying extents on all but the smallest ARM cores. For example, the widest cores like the Cortex-X2 and Apple M1 can run four SIMD operations per cycle. Your code from above is pretty much perfect for them:
The add/uaddw[2] operations give a critical-path latency of two cycles, so you'd want at least eight SIMD operations to take full advantage of the core, and that's exactly how many you have :) |
In theory, #14234 should now support aarch64. It does 8 parallel accumulator streams. GCC 12.2 generates the following for the native case:
For the byteswap case, GCC 12.2 generates:
It has 1 more instruction than Clang's output, but it is close to optimal for 8 parallel accumulator streams. That might be overkill for the hardware, but I guess it would be good for future hardware. You can see the full disassembly for the important things from various compilers at godbolt: https://gcc.godbolt.org/z/Kfja8Tzxq In particular, GCC versions older than GCC 12 generate very inoptimal assembly. |
@ryao, thanks a lot! It looks that the GCC12 produced ASM looks quite nice to boost the neon performance. I will follow this to re-write the code. |
I would suggest staying with the 4 accumulator version here, but optimizing it based on the clang output that I provided earlier. If both PRs are merged and you switch to the 8 accumulator version, we will end up with two implementations that are basically the same (when GCC 12 or later, or Clang is used), which would mean one should be dropped in favor of another. Using different accumulator counts means that they can co-exist. Also, more accumulator streams is not necessarily better. When the key loop is already optimal, adding more accumulator streams will not speed it up, but it will slow down the recombination needed to be done at the end, which becomes increasingly complex as the accumulator stream count increases. Having implementations for both the 4 stream and 8 stream levels is likely ideal. I am hopeful that we could use generic GNU C to move away from hard coding assembly. #14234 serves as a proof of concept for that. The current results are good enough that it should be able to be merged based on them. At the moment, that PR needs people to test it to verify it has good performance on both aarch64 and ppc64, plus that it works correctly. I expect it does based on theory, but theory and practice are not always in agreement. That being said, the review process of this PR was what inspired me to do #14234. I have been very surprised by what I have found as a result of that. I remember thinking about doing vector operations in generic C years ago to support multiple architectures, but compiler technology was not good enough for it at the time. I had forgotten about the idea until @rincebrain suggested it to me independently when I mentioned that I was looking into this. Things are not only different now, but on modern hardware, the result beats the work by Intel in 2015 that I had considered to be the best possible from AVX2, which has amazed me. :) |
@kevinzs2048 The optimal assembly for the 4 accumulator version for aarch64 is likely this: https://gcc.godbolt.org/z/Gh71o5vPb Note that I changed the for loop to a do-while loop. This eliminates an unnecessary check to see if the loop will iterate, but it always will unless someone gave us a 0 byte buffer, so we can just assume that. In theory, you could just save that output as
That should give us maximum performance on typical aarch64 processors. We would want to make sure that this is disabled on big endian builds. |
Hi @ryao: |
There was a small mistake in the byteswap case for the prefetch code I provided where tend was not initialized to ip: https://gcc.godbolt.org/z/5YM7K3n6d That said, you seem to have already realized it: I thought that I had made the same changes to both, but missed that. Sorry for the trouble that caused. |
@ryao I agree with you, will submit another patch for aarch64_neon_prefetch to address this. For the performance, 3 realization:
The original commit:
|
No worries, after the core dump I found that the register is not right. :-) |
That means that the core design does not work the way I thought it did. I guess the previous version is the best we can do there for the byteswap case. |
1b530b6
to
c10b5f3
Compare
@kevinzs2048 The buildbot is sometimes flaky. That is not the result of anything you did here. That being said, there are a few problems with the current version that will cause it to generate incorrect checksums. This is easy to verify by creating a pool with an unpatched driver and then trying to use it with the patched driver:
The definition of We avoided writing into beyond our heap allocation only because The 2 stream mixing matrix in In hindsight, I should have explained where to find the 4 stream mixing matrix. Also, after you change the mixing matrix, the indentation will need to be fixed according to what |
a60a7f3
to
634f7df
Compare
Optimize the neon implementation which is generated by GCC12 or Clang by the commit: openzfs#14234 Due to the compiler limitation, the GCC version before 12 will produce not optimized code for Aarch64, so I use the asm function to accelerate the performance here. Performance Data for Arm64 N1 Server Ampere Altra, 160C, 512G The original neon implementation: 0 0 0x01 -1 0 62475043711240 62475677462040 implementation native byteswap scalar 4465160644 4037798914 superscalar 6233546354 5449837582 superscalar4 7310461184 5828498374 aarch64_neon 8042460500 7427753772 fastest aarch64_neon aarch64_neon The Optimized Neon: 0 0 0x01 -1 0 582868704627960 582869571278240 implementation native byteswap scalar 4466111548 4006174065 superscalar 6217652465 5435852773 superscalar4 7305877024 5884760221 aarch64_neon 11783959542 10005496183 fastest aarch64_neon aarch64_neon Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org> Co-authored-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
634f7df
to
cd43be5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
@behlendorf Could you help to review? Thanks! @ryao already left comment here: |
He just got back from a break and has a backlog of PRs to review. Marking this PR with the code review needed flag put it on a list of PRs that should have his attention soon. The last couple weeks are a slow time of year for getting things merged since many project members are on vacation. |
@ryao Thanks for that info, no hurries, I will wait. |
|
||
void | ||
fletcher_4_aarch64_neon_native( | ||
fletcher_4_ctx_t *ctx, const void *buf, uint64_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra tab
fletcher_4_ctx_t *ctx, const void *buf, uint64_t size) | |
fletcher_4_ctx_t *ctx, const void *buf, uint64_t size) |
@@ -0,0 +1,121 @@ | |||
/* | |||
* GNU C source file that we manually compile with Clang for aarch64. | |||
* The compiled assembly goes into: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extend this comment to show the exact steps to generate module/zcommon/zfs_fletcher_aarch64_neon_impl.S
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang -O2 --target=aarch64 -S -o module/zcommon/zfs_fletcher_aarch64_neon_impl.S contrib/zcommon/zfs_fletcher_aarch64_neon_impl.c
should do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! This LGTM based on your test results, however I wasn't able to test it myself. It would have been nice to be able able to test this on a wider range of hardware, but I think we can still move forward with it.
This makes a few fundamental improvements over the existing code.
The result should perform better than the existing code on all aarch64 CPUs where we are not already memory bandwidth limited. |
I've gone ahead and pulled in #14228 to master. If you could rebase this PR, make any needed updates, and address those two minor outstanding comments this should be good to go. Thanks! |
@behlendorf It occurs to me that we might be able to treat this in a special way considering that this file is machine generated by LLVM/Clang and Windows support is being developed exclusively around LLVM/Clang. What I am thinking is:
In theory, this should avoid the need for us to to manually fix the output of the compiler for the same assembly to be usable by both. @lundman What do you think? If people are not happy with that idea, another possibility would be to tell LLVM/Clang to output a Windows version for Windows systems. |
I've not read the PR fully, initial thoughts are generally: At the moment, NT produced assembly files are useless to me, since I do everything with clang. The only thing we needed to change was the calling convention (registers used). If I had a vote, I'd prefer a solution like aarch64/asm_linkage.h, then the assembly files using ENTRY(), SETSIZE etc. If I have no vote, just do whats good for you guys, and I'll port around it later. .... But I have no Windows/arm port at all yet, and M1/ARM64 has no neon.h so I've been unable to do anything neon there. |
In an ideal world, we would not be using assembly at all here and would just compile the .c file directly. However, GCC generates really bad code, so we need to cache the .s file from Clang to avoid bad code generation when using GCC. Unfortunately, implementing that workaround now means that we have a .s file that is not compliant with the coding standard used for the existing .s files in the tree to make them portable. However, since this .s file is actually compiled from C code, rather than editing it by hand to make it compliant with that coding standard, we could just compile the .c file with Clang directly on the platforms that cannot use the cached .s file. My question was if you are okay with making that exception to the rule. I feel like this should not cause any work for you. The only thing you would need to do differently is have cmake grab the .c file instead of the .s file. |
Ah is that what you are doing, yes, if the .S is just the clang produced output, then using the C file on clang platforms make perfect sense. |
The initial plan is to just reuse the cached .s file even with clang to keep things simple and let cmake use the .c file on Windows. I suspect that MacOS will need the autotools based build system changed to use the .c file directly when using any compiler, but I would prefer to leave that for a follow-up patch (that I am willing to volunteer to write). If that is alright with you, then we can merge this. |
Optimize the neon implementation with Stride 4 and use more Neon registers. The optimization refer to the superscalar4 algorithm, so that in each cycle the there are 2 pipelines concurrently which are doing add operation.
Motivation and Context
The original neon fletcher4 algorithm are doing add operation in each cycle and the add operations are running one by one, which can not get the best performance. Use more Neon registers and optimize the algorithms can increase the fletcher4 performance.
Description
Optimize the fletcher4 neon implementation
How Has This Been Tested?
The newly neon optimizations implementation achieve 18% improvement for native, and 12% for byteswap.
The original neon:
Types of changes
Checklist:
Signed-off-by
.