Optimize the fletcher4 neon implementation #14219

kevinzs2048 · 2022-11-23T07:52:52Z

Optimize the neon implementation with Stride 4 and use more Neon registers. The optimization refer to the superscalar4 algorithm, so that in each cycle the there are 2 pipelines concurrently which are doing add operation.

Motivation and Context

The original neon fletcher4 algorithm are doing add operation in each cycle and the add operations are running one by one, which can not get the best performance. Use more Neon registers and optimize the algorithms can increase the fletcher4 performance.

Description

Optimize the fletcher4 neon implementation

How Has This Been Tested?

The newly neon optimizations implementation achieve 18% improvement for native, and 12% for byteswap.
The original neon:

    The original neon:

    0 0 0x01 -1 0 62475043711240 62475677462040
    implementation   native         byteswap
    scalar           4465160644     4037798914
    superscalar      6233546354     5449837582
    superscalar4     7310461184     5828498374
    aarch64_neon     8042460500     7427753772
    fastest          aarch64_neon   aarch64_neon

    The optimized neon:
    0 0 0x01 -1 0 69056393017400 69057006899240
    implementation   native         byteswap
    scalar           4464210145     4048399675
    superscalar      6263333631     5455224618
    superscalar4     7309441985     5799164892
    aarch64_neon     9550742326     8365852880
    fastest          aarch64_neon   aarch64_neon

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

rincebrain · 2022-11-24T17:46:18Z

On which platform are those results?

I worry that this might be better on some platforms and worse on others.

kevinzs2048 · 2022-11-25T01:55:51Z

Hi @rincebrain, thanks for the comment.

The implementation just use 2 add pipeline concurrently, so that in each cycle, the Arm64 CPU can out of order to run the adding operation to accelerate.

I got the result on Arm Neoverse N1(160C, 512G, NVME), Ampere Altra.

Besides, I have test on other 2 platform and got the same improvement:

Arm v8.2 8C16G VM(Huawei Cloud, should be Kunpeng 920):

The Original Neon:
0 0 0x01 -1 0 2111075643267260 2111076173754880
implementation   native         byteswap
scalar           3512582071     3112101740
superscalar      4908144542     4314861222
superscalar4     5508926730     4566048868
aarch64_neon     4642335831     4078792593
fastest          superscalar4   superscalar4

The Optimized Neon:
0 0 0x01 -1 0 2111006840726770 2111007375065550
implementation   native         byteswap
scalar           3478960203     3146632656
superscalar      4967788700     4179133744
superscalar4     5419802683     4595137877
aarch64_neon     5204141670     4372505460
fastest          superscalar4   superscalar4

On the Alibaba Cloud, Yitian 710, Arm V9 2C 8G NVME:

The original Neon:
0 0 0x01 -1 0 1090109187580 1090527040620
implementation   native         byteswap
scalar           4750437747     4257269008
superscalar      6172451141     5765366323
superscalar4     8332116251     6786899676
aarch64_neon     7559075098     7612305123
fastest          superscalar4   aarch64_neon

The optimized Neon:
0 0 0x01 -1 0 954157419780 954575079600
implementation   native         byteswap
scalar           4725923088     4239282790
superscalar      6125396500     5788200874
superscalar4     8318235725     6720994776
aarch64_neon     8767235685     7813240937
fastest          aarch64_neon   aarch64_neon

KungFuJesus · 2022-11-25T19:35:58Z

It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something?

On x86 I think it's universally faster to use SIMD for the checksums.

ryao · 2022-11-25T21:14:39Z

include/zfs_fletcher.h

@@ -88,7 +88,7 @@ typedef struct zfs_fletcher_avx512 {
 } zfs_fletcher_avx512_t;

 typedef struct zfs_fletcher_aarch64_neon {
-	uint64_t v[2] __attribute__((aligned(16)));


Why do we drop the aligned attribute?

I have a patch doing that pending anyway, because we're lying when we claim that.

Good point.

Not quite sure here, since I re-write the Neon fletcher4 according to the superscalar4, so I need to align with the superscalar4 data structure..

ryao · 2022-11-25T21:20:34Z

It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something?

One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work?

ryao · 2022-11-25T22:04:14Z

One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work?

Thinking about this some more, it would make sense that custom processors used by hyperscalers would emulate NEON instructions in scalar operations. Emulation would allow the die area to be reused for other things like more cores or more cache, or simply reduced for better yields. If I were designing a processor for a hyperscaler, emulating the NEON instructions seems like it would be a fairly straightforward decision, especially since most of their customers are moving SIMD workloads to GPGPU code these days.

kevinzs2048 · 2022-11-26T05:32:13Z

It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something?

One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work?

Usually the Neon implementation is better than the superscalar ones, according to the test, on much of the CPUs the Neon one is better than superscarlar one. But one some dedicated CPU ,the NEON instructions latency is much larger than the command CPU instructions, so that that is the hardware problem and will be repaired in the future version.

So generally the NEON one is better.

kevinzs2048 · 2022-11-26T05:36:59Z

Yes another reason is maybe @ryao referred, the test on hyperscalers is running on the VM, and the Neon better result I got is from a Baremetal machine.
I will also do some test on some new baremetal machines and post result here.

ryao · 2022-11-26T17:49:17Z

Yes another reason is maybe @ryao referred, the test on hyperscalers is running on the VM, and the Neon better result I got is from a Baremetal machine. I will also do some test on some new baremetal machines and post result here.

If it is not too much trouble, it would be interesting to see benchmark numbers from an Apple Silicon ARM machine. Their hardware definitely has a performant NEON implementation, since Rosetta 2 translates SSE2 instructions into NEON instructions:

https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/

You would want to use Asahi Linux for that:

https://asahilinux.org/

KungFuJesus · 2022-11-26T18:18:52Z

At least from my experience with it on zlib-ng with the adler checksum, native neon is for sure faster with m1 and many SIPs based on the cortex family.

A possible suggestion based on work @dougallj has done on the zlib side of things: has anyone implemented Fletcher4 with umla2? It'd require 2 instructions for both halves but it would both widen and accumulate back to a, b, c, & d.

You'd have to do runtime detection to see if it's supported but I suspect it might be faster.

Edit: err perhaps this was a stupid suggestion given the vectorized Fletcher implementations don't actually multiply the mixing matrix until the finalization of the checksum and can handle things instead with a straight addition without the need to handle potential overflow at each sum with a rebase like adler does.

ryao · 2022-11-27T06:13:04Z

module/zcommon/zfs_fletcher_aarch64_neon.c

+	asm("ld1 { %[AA2].4s, %[A3A4].4s }, %[CTX0]\n"          \
+	"ld1 { %[BB2].4s, %[B3B4].4s }, %[CTX1]\n"              \
+	"ld1 { %[CC2].4s, %[C3C4].4s }, %[CTX2]\n"              \
+	"ld1 { %[DD2].2d, %[D3D4].2d }, %[CTX3]\n"              \


It is a pre-existing issue, but this likely is broken on big endian:

https://gist.github.com/lirenlin/8647388

ldr might be a better choice here. The two probably perform the same.

@ryao Sorry for late response here. The ld1-ld4 are used for Neon register dedicated, and usually it's easy to use ld1-ld4 to operate 4 register once. But for ldr, easy command it can just operate one register[1]. So the ld1-ld4 can actually save some cpu cycle when load the data from memory.

In my opinion, we don't worry about the Big endian issue. The link you referred before is showing the issue for ld1 when it on big endian. Arm64 supports both little endian and big endian, but the endianess choice is according to the IMPLEMENT DEFINED[2], actually there is no real hardware for Arm64 big endian. The lack of user cases and gaps for the upper layer software compliance prevent the Arm64BE. E.g, there is no Golang supported for Arm64BE[3].

https://developer.arm.com/documentation/102159/0400/Load-and-store---data-structures

https://developer.arm.com/documentation/102376/0100/Alignment-and-endianness

https://stackoverflow.com/questions/69174919/can-go-build-support-armbe-arm64be

ryao · 2022-11-27T06:16:49Z

module/zcommon/zfs_fletcher_aarch64_neon.c

-	for (; ip < ipend; ip += 2) {
-		NEON_MAIN_LOOP(NEON_DONT_REVERSE);
+	for (; ip < ipend; ip += 4) {
+		NEON_MAIN_LOOP_LOAD(NEON_DO_NOT_REVERSE);


It might be helpful if we had the assembly/disassembly from this loop to study. For example, Clang generates the following for a version of the fletcher4 code that I will probably put into a PR soon:

.LBB0_2: // =>This Inner Loop Header: Depth=1 ldr q16, [x1], #16 uaddw v4.2d, v4.2d, v16.2s uaddw2 v0.2d, v0.2d, v16.4s add v1.2d, v0.2d, v1.2d add v5.2d, v4.2d, v5.2d add v6.2d, v5.2d, v6.2d add v2.2d, v1.2d, v2.2d add v3.2d, v2.2d, v3.2d add v7.2d, v6.2d, v7.2d cmp x1, x8 b.lo .LBB0_2

I am not an expert on ARM assembly, but this looks as tight as possible. It is unfortunate that GCC generates much worse assembly from the same code that Clang compiled to generate this. The NEON_MAIN_LOOP_LOAD code looks similar to this, although I wonder what GCC generated around it.

Nice point, I will re-write the asm here to follow this type.

Here is a sandbox you can use to see the assembly generated by clang:

https://gcc.godbolt.org/z/aP9ff3jPh

I realize that the x86-64 “version” of clang was used, but the beauty of LLVM is that one binary can support multiple architectures by setting the target. Since that site does not have a clang explicitly for aarch64/arm, I set a target as a compiler flag instead.

You can grab the byteswap version from #14234 and put it there to see what is generated for the byteswap case. Make sure to drop the static keyword when you do or else you will not get any assembly since the compiler will think it is dead code.

If you are compiling the code for arm/aarch64 locally, you can use objdump -d /path/to/zfs_fletcher_aarch64_neon.o to see the assembly that gcc generated. I suggest building with --enable-debuginfo if you do, so that the static symbols will be marked properly in the disassembly. That makes it fairly simple to identify the main loop. You just need to look for the basic block that can jump to its own start inside the function’s disassembly.

ryao · 2022-11-27T06:25:36Z

module/zcommon/zfs_fletcher_aarch64_neon.c

-		NEON_MAIN_LOOP(NEON_DONT_REVERSE);
+	for (; ip < ipend; ip += 4) {
+		NEON_MAIN_LOOP_LOAD(NEON_DO_NOT_REVERSE);
+		PREF1KL1(ip, 1024)


Have we confirmed that prefetch helps here? It is possible for prefetch instructions to be harmful. In this case, on the last loop iteration, this instruction will have prefetched 1008 bytes into L1 cache that are unlikely to ever be used, which means that we are filling the L1 cache with garbage that we are unlikely to use. This might not show a performance issue in the benchmark, but it could slow down the system elsewhere.

In addition, Linux mainline has had various less than flattering things to say about explicit prefetch instructions over the years:

https://lwn.net/Articles/404103/
https://lwn.net/Articles/444344/

@ryao Thanks for pointing out here, I do have some test on the following platform and get performance increase on Kunpeng 920 VM. For Ampere Altra and Aws Gravtion 3, the pre-fetch L1 does not get any improvement. So I agree with you we can remove the prefetch code here.

Ampere Altra: With L1 Prefetch 0 0 0x01 -1 0 20253112240840 20293116392520 implementation native byteswap scalar 4465636046 3986526251 superscalar 6288311844 5471735330 superscalar4 7267898111 5856494177 aarch64_neon 9549582587 7824610103 fastest aarch64_neon aarch64_neon Without 0 0 0x01 -1 0 20361322942120 20362454024320 implementation native byteswap scalar 4441989324 4047774560 superscalar 6273451195 5428114404 superscalar4 7273948180 5848654377 aarch64_neon 9448333032 7895309088 fastest aarch64_neon aarch64_neon Kunpeng 920 VM With Pre-L1 0 0 0x01 -1 0 28699859899380 28700387328790 implementation native byteswap scalar 3505946470 3087293256 superscalar 4915305602 4253383496 superscalar4 5473091929 4556896233 aarch64_neon 5199690074 4318504180 fastest superscalar4 superscalar4 without Prefetch L1 0 0 0x01 -1 0 28592205443970 28592735589200 implementation native byteswap scalar 3485493950 3145971812 superscalar 4925580275 4170573437 superscalar4 5476808168 4574764270 aarch64_neon 4804224295 4147109889 fastest superscalar4 superscalar4

rdolbeau · 2022-11-28T16:41:11Z

If it is not too much trouble, it would be interesting to see benchmark numbers from an Apple Silicon ARM machine.

It would also be interesting to check on AWS Graviton 3, with both the original and updated NEON versions, as the added parallelism should shine on the Neoverse V1 core - it has 4 full (128-bits) NEON pipelines. At some point, someone will have to tackle the SVE version of the Fletcher & RAID-Z code, if only to see if that's better than NEON on V1 and A64FX. It's on my TODO list but I haven't found the time yet.

kevinzs2048 · 2022-11-29T15:16:55Z

Hi @rincebrain,

I'm also interested in the SVE support for Raidz and Fletcher4. We have the AF64X hardware in the lab, and I plan to do the SVE support after done this NEON one.

And below are the test result for AWS Gravtion3, 4C 8G:

Original Neon:
0 0 0x01 -1 0 2606383072134 2606800227344
implementation   native         byteswap
scalar           5391857009     4261642208
superscalar      6389451010     5835082525
superscalar4     7949629461     6393385748
aarch64_neon     8476800633     8288433986
fastest          aarch64_neon   aarch64_neon

The Optimized Neon:
0 0 0x01 -1 0 2665497513653 2665914434828
implementation   native         byteswap
scalar           5429952028     4254408151
superscalar      6367383868     5858813389
superscalar4     7946715017     6400527387
aarch64_neon     11161079730    10121251559
fastest          aarch64_neon   aarch64_neon

rdolbeau · 2022-11-29T15:53:53Z

I'm also interested in the SVE support for Raidz and Fletcher4. We have the AF64X hardware in the lab, and I plan to do the SVE support after done this NEON one.

Beware than A64FX has 512 bits SVE, Graviton 3 (Neoverse V1) has 256 bits. The code need to be scalable, or to adapt to the machine - I haven't looked a Fletcher recently, but IIRC the algorithm is width-dependent (well, number of parallel computations-dependent). Also I don't know for integer, but for FP the SVE latencies are quite high on A64FX, it's a throughput machine, not a latency one.

And below are the test result for AWS Gravtion3, 4C 8G:

Thanks! As expected/hoped, NEON is quite good (vs. superscalar4) and the unrolled version really shine with >30% improvement on native.V1 does benefit from those 4 128-bits pipeline on this code.

ryao · 2022-11-29T15:56:58Z

I haven't looked a Fletcher recently, but IIRC the algorithm is width-dependent (well, number of parallel computations-dependent).

It certainly is. I also just discovered that Zen 3 is capable of doing multiple vector additions in parallel, which allowed widening the parallel additions to be a performance win:

#14234 (comment)

This effect likely also exists on ARM processors, if any of them are capable of doing multiple vector additions in parallel. i.e. superscalar SIMD

dougallj · 2022-11-30T04:10:14Z

This effect likely also exists on ARM processors, if any of them are capable of doing multiple vector additions in parallel. i.e. superscalar SIMD

Yeah, generally this is possible to varying extents on all but the smallest ARM cores. For example, the widest cores like the Cortex-X2 and Apple M1 can run four SIMD operations per cycle. Your code from above is pretty much perfect for them:

.LBB0_2:
        ldr     q16, [x1], #16
        uaddw   v4.2d, v4.2d, v16.2s
        uaddw2  v0.2d, v0.2d, v16.4s
        add     v1.2d, v0.2d, v1.2d
        add     v5.2d, v4.2d, v5.2d
        add     v6.2d, v5.2d, v6.2d
        add     v2.2d, v1.2d, v2.2d
        add     v3.2d, v2.2d, v3.2d
        add     v7.2d, v6.2d, v7.2d
        cmp     x1, x8
        b.lo    .LBB0_2

The add/uaddw[2] operations give a critical-path latency of two cycles, so you'd want at least eight SIMD operations to take full advantage of the core, and that's exactly how many you have :)

ryao · 2022-12-01T03:25:24Z

In theory, #14234 should now support aarch64. It does 8 parallel accumulator streams. GCC 12.2 generates the following for the native case:

        ldp     q21, q20, [x1]
        add     x1, x1, 32
        uaddw   v19.2d, v19.2d, v21.2s
        uaddw2  v18.2d, v18.2d, v21.4s
        uaddw   v17.2d, v17.2d, v20.2s
        uaddw2  v16.2d, v16.2d, v20.4s
        add     v7.2d, v7.2d, v19.2d
        add     v6.2d, v6.2d, v18.2d
        add     v5.2d, v5.2d, v17.2d
        add     v4.2d, v4.2d, v16.2d
        add     v3.2d, v3.2d, v7.2d
        add     v2.2d, v2.2d, v6.2d
        add     v1.2d, v1.2d, v5.2d
        add     v0.2d, v0.2d, v4.2d
        add     v25.2d, v25.2d, v3.2d
        add     v24.2d, v24.2d, v2.2d
        add     v23.2d, v23.2d, v1.2d
        add     v22.2d, v22.2d, v0.2d
        cmp     x2, x1
        bhi     .L3

For the byteswap case, GCC 12.2 generates:

.L8:
        ldp     q1, q0, [x1]
        add     x1, x1, 32
        rev32   v1.16b, v1.16b
        rev32   v0.16b, v0.16b
        uaddw   v21.2d, v21.2d, v1.2s
        uaddw2  v20.2d, v20.2d, v1.4s
        uaddw   v19.2d, v19.2d, v0.2s
        uaddw2  v18.2d, v18.2d, v0.4s
        add     v17.2d, v17.2d, v21.2d
        add     v16.2d, v16.2d, v20.2d
        add     v7.2d, v7.2d, v19.2d
        add     v6.2d, v6.2d, v18.2d
        add     v5.2d, v5.2d, v17.2d
        add     v4.2d, v4.2d, v16.2d
        add     v3.2d, v3.2d, v7.2d
        add     v2.2d, v2.2d, v6.2d
        add     v25.2d, v25.2d, v5.2d
        add     v24.2d, v24.2d, v4.2d
        add     v23.2d, v23.2d, v3.2d
        add     v22.2d, v22.2d, v2.2d
        cmp     x2, x1
        bhi     .L8

It has 1 more instruction than Clang's output, but it is close to optimal for 8 parallel accumulator streams. That might be overkill for the hardware, but I guess it would be good for future hardware. You can see the full disassembly for the important things from various compilers at godbolt:

https://gcc.godbolt.org/z/Kfja8Tzxq

In particular, GCC versions older than GCC 12 generate very inoptimal assembly.

kevinzs2048 · 2022-12-01T07:00:43Z

@ryao, thanks a lot! It looks that the GCC12 produced ASM looks quite nice to boost the neon performance. I will follow this to re-write the code.

ryao · 2022-12-01T14:43:28Z

@ryao, thanks a lot! It looks that the GCC12 produced ASM looks quite nice to boost the neon performance. I will follow this to re-write the code.

I would suggest staying with the 4 accumulator version here, but optimizing it based on the clang output that I provided earlier. If both PRs are merged and you switch to the 8 accumulator version, we will end up with two implementations that are basically the same (when GCC 12 or later, or Clang is used), which would mean one should be dropped in favor of another. Using different accumulator counts means that they can co-exist.

Also, more accumulator streams is not necessarily better. When the key loop is already optimal, adding more accumulator streams will not speed it up, but it will slow down the recombination needed to be done at the end, which becomes increasingly complex as the accumulator stream count increases.

Having implementations for both the 4 stream and 8 stream levels is likely ideal. I am hopeful that we could use generic GNU C to move away from hard coding assembly. #14234 serves as a proof of concept for that. The current results are good enough that it should be able to be merged based on them. At the moment, that PR needs people to test it to verify it has good performance on both aarch64 and ppc64, plus that it works correctly. I expect it does based on theory, but theory and practice are not always in agreement.

That being said, the review process of this PR was what inspired me to do #14234. I have been very surprised by what I have found as a result of that. I remember thinking about doing vector operations in generic C years ago to support multiple architectures, but compiler technology was not good enough for it at the time. I had forgotten about the idea until @rincebrain suggested it to me independently when I mentioned that I was looking into this. Things are not only different now, but on modern hardware, the result beats the work by Intel in 2015 that I had considered to be the best possible from AVX2, which has amazed me. :)

ryao · 2022-12-01T20:49:59Z

@kevinzs2048 The optimal assembly for the 4 accumulator version for aarch64 is likely this:

https://gcc.godbolt.org/z/Gh71o5vPb

Note that I changed the for loop to a do-while loop. This eliminates an unnecessary check to see if the loop will iterate, but it always will unless someone gave us a 0 byte buffer, so we can just assume that.

In theory, you could just save that output as module/zcommon/zfs_fletcher_aarch64_neon_impl.s, conditionally add it to the make file for aarch64 (so that it is assembled and linked) and then call into it from C in module/zcommon/zfs_fletcher_aarch64_neon.c like this:

extern void
fletcher_4_aarch64_neon_native_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

extern void
fletcher_4_aarch64_neon_byteswap_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_native(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_native_impl(ctx, buf, size);
        kfpu_end();
}

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_byteswap_impl(ctx, buf, size);
        kfpu_end();
}

That should give us maximum performance on typical aarch64 processors.

We would want to make sure that this is disabled on big endian builds.

kevinzs2048 · 2022-12-02T15:07:28Z

I would suggest staying with the 4 accumulator version here, but optimizing it based on the clang output that I provided earlier. If both PRs are merged and you switch to the 8 accumulator version, we will end up with two implementations that are basically the same (when GCC 12 or later, or Clang is used), which would mean one should be dropped in favor of another. Using different accumulator counts means that they can co-exist.

Also, more accumulator streams is not necessarily better. When the key loop is already optimal, adding more accumulator streams will not speed it up, but it will slow down the recombination needed to be done at the end, which becomes increasingly complex as the accumulator stream count increases.

Having implementations for both the 4 stream and 8 stream levels is likely ideal. I am hopeful that we could use generic GNU C to move away from hard coding assembly. #14234 serves as a proof of concept for that. The current results are good enough that it should be able to be merged based on them. At the moment, that PR needs people to test it to verify it has good performance on both aarch64 and ppc64, plus that it works correctly. I expect it does based on theory, but theory and practice are not always in agreement.

That being said, the review process of this PR was what inspired me to do #14234. I have been very surprised by what I have found as a result of that. I remember thinking about doing vector operations in generic C years ago to support multiple architectures, but compiler technology was not good enough for it at the time. I had forgotten about the idea until @rincebrain suggested it to me independently when I mentioned that I was looking into this. Things are not only different now, but on modern hardware, the result beats the work by Intel in 2015 that I had considered to be the best possible from AVX2, which has amazed me. :)

@kevinzs2048 The optimal assembly for the 4 accumulator version for aarch64 is likely this:

https://gcc.godbolt.org/z/Gh71o5vPb

Note that I changed the for loop to a do-while loop. This eliminates an unnecessary check to see if the loop will iterate, but it always will unless someone gave us a 0 byte buffer, so we can just assume that.

In theory, you could just save that output as module/zcommon/zfs_fletcher_aarch64_neon_impl.s, conditionally add it to the make file for aarch64 (so that it is assembled and linked) and then call into it from C in module/zcommon/zfs_fletcher_aarch64_neon.c like this:
extern void
fletcher_4_aarch64_neon_native_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

extern void
fletcher_4_aarch64_neon_byteswap_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_native(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_native_impl(ctx, buf, size);
        kfpu_end();
}

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_byteswap_impl(ctx, buf, size);
        kfpu_end();
}
That should give us maximum performance on typical aarch64 processors.

We would want to make sure that this is disabled on big endian builds.

Hi @ryao:
Thanks for the details advise, really appreciated! I'm working on it now and will update the patch.

ryao · 2022-12-11T05:19:33Z

There was a small mistake in the byteswap case for the prefetch code I provided where tend was not initialized to ip:

https://gcc.godbolt.org/z/5YM7K3n6d

That said, you seem to have already realized it:

kevinzs2048@2986024

I thought that I had made the same changes to both, but missed that. Sorry for the trouble that caused.

kevinzs2048 · 2022-12-12T14:48:32Z

I think we could split this into two implementations. The current no prefetch version will be renamed to neon from aarch64_neon and given the upgraded no prefetch assembly. This is more consistent with how things are on other platforms. The prefetch version would then be named neon_prefetch and be a second patch. The difference on the kupeng 920 should meet the threshold for another implementation being worthwhile.

That said, would you check if this version improves the byteswap case on the altra?

https://gcc.godbolt.org/z/Pefexjj48

If the cores work the way I think they work, it should make the byteswap case perform identically to the native case.

@ryao I agree with you, will submit another patch for aarch64_neon_prefetch to address this.

For the performance, 3 realization:

The above link's code
0 0 0x01 -1 0 1444940378440 1445635521280
implementation   native         byteswap
scalar           4464590296     4042936458
superscalar      6250453028     5416338231
superscalar4     7284307051     5885751171
aarch64_neon     11673759602    8098362681
fastest          aarch64_neon   aarch64_neon

Prefetch L1:
0 0 0x01 -1 0 1531826195440 1532457802000
implementation   native         byteswap
scalar           4466492023     4002504007
superscalar      6208816650     5434444156
superscalar4     7304350249     5882614305
aarch64_neon     11162584720    9337275155
fastest          aarch64_neon   aarch64_neon

The original commit:

0 0 0x01 -1 0 1567011165720 1567651130720
implementation   native         byteswap
scalar           4464115117     4065035859
superscalar      6230953442     5433036269
superscalar4     7304859103     5830929210
aarch64_neon     11757093735    9937539093
fastest          aarch64_neon   aarch64_neon

kevinzs2048 · 2022-12-12T14:50:40Z

There was a small mistake in the byteswap case for the prefetch code I provided where tend was not initialized to ip:

https://gcc.godbolt.org/z/5YM7K3n6d

That said, you seem to have already realized it:

kevinzs2048@2986024

I thought that I had made the same changes to both, but missed that. Sorry for the trouble that caused.

No worries, after the core dump I found that the register is not right. :-)

ryao · 2022-12-12T15:21:21Z

For the performance, 3 realization:

The above link's code
0 0 0x01 -1 0 1444940378440 1445635521280
implementation   native         byteswap
scalar           4464590296     4042936458
superscalar      6250453028     5416338231
superscalar4     7284307051     5885751171
aarch64_neon     11673759602    8098362681
fastest          aarch64_neon   aarch64_neon

The original commit:

0 0 0x01 -1 0 1567011165720 1567651130720
implementation   native         byteswap
scalar           4464115117     4065035859
superscalar      6230953442     5433036269
superscalar4     7304859103     5830929210
aarch64_neon     11757093735    9937539093
fastest          aarch64_neon   aarch64_neon

That means that the core design does not work the way I thought it did. I guess the previous version is the best we can do there for the byteswap case.

kevinzs2048 · 2022-12-14T05:27:37Z

@ryao Please help to review again. There are some jobs failing. It looks that the master commit CI jobs were also failed: #14287

ryao · 2022-12-14T20:24:22Z

@kevinzs2048 The buildbot is sometimes flaky. That is not the result of anything you did here.

That being said, there are a few problems with the current version that will cause it to generate incorrect checksums. This is easy to verify by creating a pool with an unpatched driver and then trying to use it with the patched driver:

# Make test pool
truncate -s 1G /tmp/zfs
sudo zpool create tank /tmp/zfs

# This loads the new modules
sudo ./scripts/zfs.sh -v -r

# Import old pool and encounter failures, assuming that the neon implementation performs best.
sudo zpool import -d /tmp tank

The definition of zfs_fletcher_aarch64_neon_t needs to be changed from having v[2] to v[4] because this code switches to 4 accumulator streams from 2. The number of additions confusingly remains the same because the 2 accumulator version has been unrolled to do 2 iterations per loop. I mentioned that here:

#14219 (comment)

We avoided writing into beyond our heap allocation only because zfs_fletcher_aarch64_neon_t is part of a union type with zfs_fletcher_superscalar_t. Had it not been for that, the benchmark would have been writing beyond the end of the memory allocation.

The 2 stream mixing matrix in fletcher_4_aarch64_neon_fini() needs to be replaced with the 4 stream mixing matrix from module/zcommon/zfs_fletcher_intel.c. When that is done, avx in it should changed to aarch64_neon (or even better, we can rename aarch64_neon to just neon in fletcher_4_ctx_t to be consistent with everything else in that union. I mentioned that need to adopt the 4 stream mixing matrix here:

#14219 (review)

In hindsight, I should have explained where to find the 4 stream mixing matrix.

Also, after you change the mixing matrix, the indentation will need to be fixed according to what ./scripts/cstyle.pl module/zcommon/zfs_fletcher_aarch64_neon.c says.

Optimize the neon implementation which is generated by GCC12 or Clang by the commit: openzfs#14234 Due to the compiler limitation, the GCC version before 12 will produce not optimized code for Aarch64, so I use the asm function to accelerate the performance here. Performance Data for Arm64 N1 Server Ampere Altra, 160C, 512G The original neon implementation: 0 0 0x01 -1 0 62475043711240 62475677462040 implementation native byteswap scalar 4465160644 4037798914 superscalar 6233546354 5449837582 superscalar4 7310461184 5828498374 aarch64_neon 8042460500 7427753772 fastest aarch64_neon aarch64_neon The Optimized Neon: 0 0 0x01 -1 0 582868704627960 582869571278240 implementation native byteswap scalar 4466111548 4006174065 superscalar 6217652465 5435852773 superscalar4 7305877024 5884760221 aarch64_neon 11783959542 10005496183 fastest aarch64_neon aarch64_neon Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org> Co-authored-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

ryao

This looks good to me.

kevinzs2048 · 2023-01-10T08:38:15Z

@behlendorf Could you help to review? Thanks! @ryao already left comment here:

ryao · 2023-01-11T00:25:41Z

He just got back from a break and has a backlog of PRs to review. Marking this PR with the code review needed flag put it on a list of PRs that should have his attention soon.

The last couple weeks are a slow time of year for getting things merged since many project members are on vacation.

kevinzs2048 · 2023-01-11T01:48:56Z

@ryao Thanks for that info, no hurries, I will wait.

behlendorf · 2023-01-12T18:46:04Z

contrib/zcommon/zfs_fletcher_aarch64_neon_impl.c

+
+void
+fletcher_4_aarch64_neon_native(
+		fletcher_4_ctx_t *ctx, const void *buf, uint64_t size)


nit: extra tab

Suggested change

fletcher_4_ctx_t *ctx, const void *buf, uint64_t size)

fletcher_4_ctx_t *ctx, const void *buf, uint64_t size)

behlendorf · 2023-01-12T18:55:24Z

contrib/zcommon/zfs_fletcher_aarch64_neon_impl.c

@@ -0,0 +1,121 @@
+/*
+ * GNU C source file that we manually compile with Clang for aarch64.
+ * The compiled assembly goes into:


Can you extend this comment to show the exact steps to generate module/zcommon/zfs_fletcher_aarch64_neon_impl.S.

clang -O2 --target=aarch64 -S -o module/zcommon/zfs_fletcher_aarch64_neon_impl.S contrib/zcommon/zfs_fletcher_aarch64_neon_impl.c should do it.

behlendorf

Thanks for the contribution! This LGTM based on your test results, however I wasn't able to test it myself. It would have been nice to be able able to test this on a wider range of hardware, but I think we can still move forward with it.

ryao · 2023-01-13T21:45:06Z

Thanks for the contribution! This LGTM based on your test results, however I wasn't able to test it myself. It would have been nice to be able able to test this on a wider range of hardware, but I think we can still move forward with it.

This makes a few fundamental improvements over the existing code.

It switches from the 2 stream 2 loop iteration unrolled version to the 4 stream version.
The load instruction and loop calculations prior to the cmp instruction have been combined.
Widening additions have been used to avoid the need for explicit widening from 32-bit to 64-bit via the zip instructions.

The result should perform better than the existing code on all aarch64 CPUs where we are not already memory bandwidth limited.

behlendorf · 2023-01-17T19:19:29Z

I've gone ahead and pulled in #14228 to master. If you could rebase this PR, make any needed updates, and address those two minor outstanding comments this should be good to go. Thanks!

ryao · 2023-01-19T19:12:54Z

@behlendorf It occurs to me that we might be able to treat this in a special way considering that this file is machine generated by LLVM/Clang and Windows support is being developed exclusively around LLVM/Clang. What I am thinking is:

On UNIX systems, we use the cached assembly.
On NT systems, we compile the GNU C code directly, with appropriate compiler flags so that neon instructions are emitted.

In theory, this should avoid the need for us to to manually fix the output of the compiler for the same assembly to be usable by both.

@lundman What do you think?

If people are not happy with that idea, another possibility would be to tell LLVM/Clang to output a Windows version for Windows systems.

lundman · 2023-02-27T23:32:07Z

I've not read the PR fully, initial thoughts are generally:

At the moment, NT produced assembly files are useless to me, since I do everything with clang. The only thing we needed to change was the calling convention (registers used).

If I had a vote, I'd prefer a solution like aarch64/asm_linkage.h, then the assembly files using ENTRY(), SETSIZE etc.
Similar to M1/ARM64 blake3 I did over here:

openzfsonosx@51a4ee6

If I have no vote, just do whats good for you guys, and I'll port around it later.

....

But I have no Windows/arm port at all yet, and M1/ARM64 has no neon.h so I've been unable to do anything neon there.

ryao · 2023-02-28T01:04:34Z

I've not read the PR fully, initial thoughts are generally:

At the moment, NT produced assembly files are useless to me, since I do everything with clang. The only thing we needed to change was the calling convention (registers used).

If I had a vote, I'd prefer a solution like aarch64/asm_linkage.h, then the assembly files using ENTRY(), SETSIZE etc. Similar to M1/ARM64 blake3 I did over here:

openzfsonosx@51a4ee6

If I have no vote, just do whats good for you guys, and I'll port around it later.

....

But I have no Windows/arm port at all yet, and M1/ARM64 has no neon.h so I've been unable to do anything neon there.

In an ideal world, we would not be using assembly at all here and would just compile the .c file directly. However, GCC generates really bad code, so we need to cache the .s file from Clang to avoid bad code generation when using GCC. Unfortunately, implementing that workaround now means that we have a .s file that is not compliant with the coding standard used for the existing .s files in the tree to make them portable. However, since this .s file is actually compiled from C code, rather than editing it by hand to make it compliant with that coding standard, we could just compile the .c file with Clang directly on the platforms that cannot use the cached .s file.

My question was if you are okay with making that exception to the rule. I feel like this should not cause any work for you. The only thing you would need to do differently is have cmake grab the .c file instead of the .s file.

lundman · 2023-02-28T01:39:34Z

Ah is that what you are doing, yes, if the .S is just the clang produced output, then using the C file on clang platforms make perfect sense.

ryao · 2023-02-28T01:48:28Z

Ah is that what you are doing, yes, if the .S is just the clang produced output, then using the C file on clang platforms make perfect sense.

The initial plan is to just reuse the cached .s file even with clang to keep things simple and let cmake use the .c file on Windows. I suspect that MacOS will need the autotools based build system changed to use the .c file directly when using any compiler, but I would prefer to leave that for a follow-up patch (that I am willing to volunteer to write). If that is alright with you, then we can merge this.

kevinzs2048 force-pushed the fletcher4-neon branch 4 times, most recently from 2507c0e to 10c0591 Compare November 23, 2022 08:48

ryao reviewed Nov 25, 2022

View reviewed changes

ryao reviewed Nov 27, 2022

View reviewed changes

ryao mentioned this pull request Nov 28, 2022

RFC: Generic Fletcher4 vector implementation #14234

Open

13 tasks

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 29, 2022

ryao added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Nov 29, 2022

kevinzs2048 force-pushed the fletcher4-neon branch from 1b530b6 to c10b5f3 Compare December 13, 2022 15:04

kevinzs2048 force-pushed the fletcher4-neon branch 4 times, most recently from a60a7f3 to 634f7df Compare December 19, 2022 08:09

kevinzs2048 force-pushed the fletcher4-neon branch from 634f7df to cd43be5 Compare December 28, 2022 07:49

ryao approved these changes Dec 28, 2022

View reviewed changes

ryao added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 8, 2023

behlendorf reviewed Jan 12, 2023

View reviewed changes

behlendorf approved these changes Jan 12, 2023

View reviewed changes

mcmilk approved these changes Feb 28, 2023

View reviewed changes

ryao mentioned this pull request May 29, 2023

Add shar's algorithm scandum/binary_search#6

Open

	fletcher_4_ctx_t ctx, const void buf, uint64_t size)
	fletcher_4_ctx_t ctx, const void buf, uint64_t size)

Optimize the fletcher4 neon implementation #14219

Are you sure you want to change the base?

Optimize the fletcher4 neon implementation #14219

Conversation

kevinzs2048 commented Nov 23, 2022 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

rincebrain commented Nov 24, 2022

kevinzs2048 commented Nov 25, 2022 • edited Loading

KungFuJesus commented Nov 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryao commented Nov 25, 2022

ryao commented Nov 25, 2022

kevinzs2048 commented Nov 26, 2022

kevinzs2048 commented Nov 26, 2022

ryao commented Nov 26, 2022 • edited Loading

KungFuJesus commented Nov 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryao Nov 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryao Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdolbeau commented Nov 28, 2022

kevinzs2048 commented Nov 29, 2022

rdolbeau commented Nov 29, 2022

ryao commented Nov 29, 2022

dougallj commented Nov 30, 2022

ryao commented Dec 1, 2022 • edited Loading

kevinzs2048 commented Dec 1, 2022

ryao commented Dec 1, 2022 • edited Loading

ryao commented Dec 1, 2022 • edited Loading

kevinzs2048 commented Dec 2, 2022 • edited Loading

ryao commented Dec 11, 2022

kevinzs2048 commented Dec 12, 2022 • edited Loading

kevinzs2048 commented Dec 12, 2022

ryao commented Dec 12, 2022

kevinzs2048 commented Dec 14, 2022

ryao commented Dec 14, 2022 • edited Loading

ryao left a comment

Choose a reason for hiding this comment

kevinzs2048 commented Jan 10, 2023

ryao commented Jan 11, 2023 • edited Loading

kevinzs2048 commented Jan 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf left a comment • edited Loading

Choose a reason for hiding this comment

ryao commented Jan 13, 2023 • edited Loading

behlendorf commented Jan 17, 2023

ryao commented Jan 19, 2023 • edited Loading

lundman commented Feb 27, 2023

ryao commented Feb 28, 2023

lundman commented Feb 28, 2023

ryao commented Feb 28, 2023 • edited Loading

kevinzs2048 commented Nov 23, 2022 •

edited

Loading

kevinzs2048 commented Nov 25, 2022 •

edited

Loading

ryao commented Nov 26, 2022 •

edited

Loading

KungFuJesus commented Nov 26, 2022 •

edited

Loading

ryao Nov 27, 2022 •

edited

Loading

ryao Nov 29, 2022 •

edited

Loading

ryao commented Dec 1, 2022 •

edited

Loading

ryao commented Dec 1, 2022 •

edited

Loading

ryao commented Dec 1, 2022 •

edited

Loading

kevinzs2048 commented Dec 2, 2022 •

edited

Loading

kevinzs2048 commented Dec 12, 2022 •

edited

Loading

ryao commented Dec 14, 2022 •

edited

Loading

ryao commented Jan 11, 2023 •

edited

Loading

behlendorf left a comment •

edited

Loading

ryao commented Jan 13, 2023 •

edited

Loading

ryao commented Jan 19, 2023 •

edited

Loading

ryao commented Feb 28, 2023 •

edited

Loading