Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the fletcher4 neon implementation #14219

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kevinzs2048
Copy link

@kevinzs2048 kevinzs2048 commented Nov 23, 2022

Optimize the neon implementation with Stride 4 and use more Neon registers. The optimization refer to the superscalar4 algorithm, so that in each cycle the there are 2 pipelines concurrently which are doing add operation.

Motivation and Context

The original neon fletcher4 algorithm are doing add operation in each cycle and the add operations are running one by one, which can not get the best performance. Use more Neon registers and optimize the algorithms can increase the fletcher4 performance.

Description

Optimize the fletcher4 neon implementation

How Has This Been Tested?

The newly neon optimizations implementation achieve 18% improvement for native, and 12% for byteswap.
The original neon:

    The original neon:

    0 0 0x01 -1 0 62475043711240 62475677462040
    implementation   native         byteswap
    scalar           4465160644     4037798914
    superscalar      6233546354     5449837582
    superscalar4     7310461184     5828498374
    aarch64_neon     8042460500     7427753772
    fastest          aarch64_neon   aarch64_neon

    The optimized neon:
    0 0 0x01 -1 0 69056393017400 69057006899240
    implementation   native         byteswap
    scalar           4464210145     4048399675
    superscalar      6263333631     5455224618
    superscalar4     7309441985     5799164892
    aarch64_neon     9550742326     8365852880
    fastest          aarch64_neon   aarch64_neon

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@kevinzs2048 kevinzs2048 force-pushed the fletcher4-neon branch 4 times, most recently from 2507c0e to 10c0591 Compare November 23, 2022 08:48
@rincebrain
Copy link
Contributor

On which platform are those results?

I worry that this might be better on some platforms and worse on others.

@kevinzs2048
Copy link
Author

kevinzs2048 commented Nov 25, 2022

Hi @rincebrain, thanks for the comment.

The implementation just use 2 add pipeline concurrently, so that in each cycle, the Arm64 CPU can out of order to run the adding operation to accelerate.

I got the result on Arm Neoverse N1(160C, 512G, NVME), Ampere Altra.

Besides, I have test on other 2 platform and got the same improvement:

Arm v8.2 8C16G VM(Huawei Cloud, should be Kunpeng 920):

The Original Neon:
0 0 0x01 -1 0 2111075643267260 2111076173754880
implementation   native         byteswap
scalar           3512582071     3112101740
superscalar      4908144542     4314861222
superscalar4     5508926730     4566048868
aarch64_neon     4642335831     4078792593
fastest          superscalar4   superscalar4

The Optimized Neon:
0 0 0x01 -1 0 2111006840726770 2111007375065550
implementation   native         byteswap
scalar           3478960203     3146632656
superscalar      4967788700     4179133744
superscalar4     5419802683     4595137877
aarch64_neon     5204141670     4372505460
fastest          superscalar4   superscalar4

On the Alibaba Cloud, Yitian 710, Arm V9 2C 8G NVME:

The original Neon:
0 0 0x01 -1 0 1090109187580 1090527040620
implementation   native         byteswap
scalar           4750437747     4257269008
superscalar      6172451141     5765366323
superscalar4     8332116251     6786899676
aarch64_neon     7559075098     7612305123
fastest          superscalar4   aarch64_neon

The optimized Neon:
0 0 0x01 -1 0 954157419780 954575079600
implementation   native         byteswap
scalar           4725923088     4239282790
superscalar      6125396500     5788200874
superscalar4     8318235725     6720994776
aarch64_neon     8767235685     7813240937
fastest          aarch64_neon   aarch64_neon

@KungFuJesus
Copy link

It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something?

On x86 I think it's universally faster to use SIMD for the checksums.

@@ -88,7 +88,7 @@ typedef struct zfs_fletcher_avx512 {
} zfs_fletcher_avx512_t;

typedef struct zfs_fletcher_aarch64_neon {
uint64_t v[2] __attribute__((aligned(16)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we drop the aligned attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a patch doing that pending anyway, because we're lying when we claim that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure here, since I re-write the Neon fletcher4 according to the superscalar4, so I need to align with the superscalar4 data structure..

@ryao
Copy link
Contributor

ryao commented Nov 25, 2022

It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something?

One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work?

@ryao
Copy link
Contributor

ryao commented Nov 25, 2022

One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work?

Thinking about this some more, it would make sense that custom processors used by hyperscalers would emulate NEON instructions in scalar operations. Emulation would allow the die area to be reused for other things like more cores or more cache, or simply reduced for better yields. If I were designing a processor for a hyperscaler, emulating the NEON instructions seems like it would be a fairly straightforward decision, especially since most of their customers are moving SIMD workloads to GPGPU code these days.

@kevinzs2048
Copy link
Author

It's a little bit surprising that the superscalar implementations are winning on that many armv8 implementations. Does the NEON pipeline on these things only have 1 execution port or something?

One possibility is that the SIMD hardware in these ARM processors is emulating SIMD via scalar operations rather than actually doing SIMD in hardware. Is there any public documentation on how the NEON instructions in those processor cores work?

Usually the Neon implementation is better than the superscalar ones, according to the test, on much of the CPUs the Neon one is better than superscarlar one. But one some dedicated CPU ,the NEON instructions latency is much larger than the command CPU instructions, so that that is the hardware problem and will be repaired in the future version.

So generally the NEON one is better.

@kevinzs2048
Copy link
Author

Yes another reason is maybe @ryao referred, the test on hyperscalers is running on the VM, and the Neon better result I got is from a Baremetal machine.
I will also do some test on some new baremetal machines and post result here.

@ryao
Copy link
Contributor

ryao commented Nov 26, 2022

Yes another reason is maybe @ryao referred, the test on hyperscalers is running on the VM, and the Neon better result I got is from a Baremetal machine. I will also do some test on some new baremetal machines and post result here.

If it is not too much trouble, it would be interesting to see benchmark numbers from an Apple Silicon ARM machine. Their hardware definitely has a performant NEON implementation, since Rosetta 2 translates SSE2 instructions into NEON instructions:

https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/

You would want to use Asahi Linux for that:

https://asahilinux.org/

@KungFuJesus
Copy link

KungFuJesus commented Nov 26, 2022

At least from my experience with it on zlib-ng with the adler checksum, native neon is for sure faster with m1 and many SIPs based on the cortex family.

A possible suggestion based on work @dougallj has done on the zlib side of things: has anyone implemented Fletcher4 with umla2? It'd require 2 instructions for both halves but it would both widen and accumulate back to a, b, c, & d.

You'd have to do runtime detection to see if it's supported but I suspect it might be faster.

Edit: err perhaps this was a stupid suggestion given the vectorized Fletcher implementations don't actually multiply the mixing matrix until the finalization of the checksum and can handle things instead with a straight addition without the need to handle potential overflow at each sum with a rebase like adler does.

asm("ld1 { %[AA2].4s, %[A3A4].4s }, %[CTX0]\n" \
"ld1 { %[BB2].4s, %[B3B4].4s }, %[CTX1]\n" \
"ld1 { %[CC2].4s, %[C3C4].4s }, %[CTX2]\n" \
"ld1 { %[DD2].2d, %[D3D4].2d }, %[CTX3]\n" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a pre-existing issue, but this likely is broken on big endian:

https://gist.github.com/lirenlin/8647388

ldr might be a better choice here. The two probably perform the same.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryao Sorry for late response here. The ld1-ld4 are used for Neon register dedicated, and usually it's easy to use ld1-ld4 to operate 4 register once. But for ldr, easy command it can just operate one register[1]. So the ld1-ld4 can actually save some cpu cycle when load the data from memory.

In my opinion, we don't worry about the Big endian issue. The link you referred before is showing the issue for ld1 when it on big endian. Arm64 supports both little endian and big endian, but the endianess choice is according to the IMPLEMENT DEFINED[2], actually there is no real hardware for Arm64 big endian. The lack of user cases and gaps for the upper layer software compliance prevent the Arm64BE. E.g, there is no Golang supported for Arm64BE[3].

  1. https://developer.arm.com/documentation/102159/0400/Load-and-store---data-structures
  2. https://developer.arm.com/documentation/102376/0100/Alignment-and-endianness
  3. https://stackoverflow.com/questions/69174919/can-go-build-support-armbe-arm64be

for (; ip < ipend; ip += 2) {
NEON_MAIN_LOOP(NEON_DONT_REVERSE);
for (; ip < ipend; ip += 4) {
NEON_MAIN_LOOP_LOAD(NEON_DO_NOT_REVERSE);
Copy link
Contributor

@ryao ryao Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful if we had the assembly/disassembly from this loop to study. For example, Clang generates the following for a version of the fletcher4 code that I will probably put into a PR soon:

.LBB0_2:                                // =>This Inner Loop Header: Depth=1
        ldr     q16, [x1], #16
        uaddw   v4.2d, v4.2d, v16.2s
        uaddw2  v0.2d, v0.2d, v16.4s
        add     v1.2d, v0.2d, v1.2d
        add     v5.2d, v4.2d, v5.2d
        add     v6.2d, v5.2d, v6.2d
        add     v2.2d, v1.2d, v2.2d
        add     v3.2d, v2.2d, v3.2d
        add     v7.2d, v6.2d, v7.2d
        cmp     x1, x8
        b.lo    .LBB0_2

I am not an expert on ARM assembly, but this looks as tight as possible. It is unfortunate that GCC generates much worse assembly from the same code that Clang compiled to generate this. The NEON_MAIN_LOOP_LOAD code looks similar to this, although I wonder what GCC generated around it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice point, I will re-write the asm here to follow this type.

Copy link
Contributor

@ryao ryao Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a sandbox you can use to see the assembly generated by clang:

https://gcc.godbolt.org/z/aP9ff3jPh

I realize that the x86-64 “version” of clang was used, but the beauty of LLVM is that one binary can support multiple architectures by setting the target. Since that site does not have a clang explicitly for aarch64/arm, I set a target as a compiler flag instead.

You can grab the byteswap version from #14234 and put it there to see what is generated for the byteswap case. Make sure to drop the static keyword when you do or else you will not get any assembly since the compiler will think it is dead code.

If you are compiling the code for arm/aarch64 locally, you can use objdump -d /path/to/zfs_fletcher_aarch64_neon.o to see the assembly that gcc generated. I suggest building with --enable-debuginfo if you do, so that the static symbols will be marked properly in the disassembly. That makes it fairly simple to identify the main loop. You just need to look for the basic block that can jump to its own start inside the function’s disassembly.

NEON_MAIN_LOOP(NEON_DONT_REVERSE);
for (; ip < ipend; ip += 4) {
NEON_MAIN_LOOP_LOAD(NEON_DO_NOT_REVERSE);
PREF1KL1(ip, 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we confirmed that prefetch helps here? It is possible for prefetch instructions to be harmful. In this case, on the last loop iteration, this instruction will have prefetched 1008 bytes into L1 cache that are unlikely to ever be used, which means that we are filling the L1 cache with garbage that we are unlikely to use. This might not show a performance issue in the benchmark, but it could slow down the system elsewhere.

In addition, Linux mainline has had various less than flattering things to say about explicit prefetch instructions over the years:

https://lwn.net/Articles/404103/
https://lwn.net/Articles/444344/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryao Thanks for pointing out here, I do have some test on the following platform and get performance increase on Kunpeng 920 VM. For Ampere Altra and Aws Gravtion 3, the pre-fetch L1 does not get any improvement. So I agree with you we can remove the prefetch code here.

Ampere Altra:
With L1 Prefetch
0 0 0x01 -1 0 20253112240840 20293116392520
implementation   native         byteswap
scalar           4465636046     3986526251
superscalar      6288311844     5471735330
superscalar4     7267898111     5856494177
aarch64_neon     9549582587     7824610103
fastest          aarch64_neon   aarch64_neon

Without
0 0 0x01 -1 0 20361322942120 20362454024320
implementation   native         byteswap
scalar           4441989324     4047774560
superscalar      6273451195     5428114404
superscalar4     7273948180     5848654377
aarch64_neon     9448333032     7895309088
fastest          aarch64_neon   aarch64_neon


Kunpeng 920 VM
With Pre-L1
0 0 0x01 -1 0 28699859899380 28700387328790
implementation   native         byteswap
scalar           3505946470     3087293256
superscalar      4915305602     4253383496
superscalar4     5473091929     4556896233
aarch64_neon     5199690074     4318504180
fastest          superscalar4   superscalar4

without Prefetch L1
0 0 0x01 -1 0 28592205443970 28592735589200
implementation   native         byteswap
scalar           3485493950     3145971812
superscalar      4925580275     4170573437
superscalar4     5476808168     4574764270
aarch64_neon     4804224295     4147109889
fastest          superscalar4   superscalar4

@rdolbeau
Copy link
Contributor

If it is not too much trouble, it would be interesting to see benchmark numbers from an Apple Silicon ARM machine.

It would also be interesting to check on AWS Graviton 3, with both the original and updated NEON versions, as the added parallelism should shine on the Neoverse V1 core - it has 4 full (128-bits) NEON pipelines. At some point, someone will have to tackle the SVE version of the Fletcher & RAID-Z code, if only to see if that's better than NEON on V1 and A64FX. It's on my TODO list but I haven't found the time yet.

@kevinzs2048
Copy link
Author

Hi @rincebrain,

I'm also interested in the SVE support for Raidz and Fletcher4. We have the AF64X hardware in the lab, and I plan to do the SVE support after done this NEON one.

And below are the test result for AWS Gravtion3, 4C 8G:

Original Neon:
0 0 0x01 -1 0 2606383072134 2606800227344
implementation   native         byteswap
scalar           5391857009     4261642208
superscalar      6389451010     5835082525
superscalar4     7949629461     6393385748
aarch64_neon     8476800633     8288433986
fastest          aarch64_neon   aarch64_neon

The Optimized Neon:
0 0 0x01 -1 0 2665497513653 2665914434828
implementation   native         byteswap
scalar           5429952028     4254408151
superscalar      6367383868     5858813389
superscalar4     7946715017     6400527387
aarch64_neon     11161079730    10121251559
fastest          aarch64_neon   aarch64_neon

@rdolbeau
Copy link
Contributor

I'm also interested in the SVE support for Raidz and Fletcher4. We have the AF64X hardware in the lab, and I plan to do the SVE support after done this NEON one.

Beware than A64FX has 512 bits SVE, Graviton 3 (Neoverse V1) has 256 bits. The code need to be scalable, or to adapt to the machine - I haven't looked a Fletcher recently, but IIRC the algorithm is width-dependent (well, number of parallel computations-dependent). Also I don't know for integer, but for FP the SVE latencies are quite high on A64FX, it's a throughput machine, not a latency one.

And below are the test result for AWS Gravtion3, 4C 8G:

Thanks! As expected/hoped, NEON is quite good (vs. superscalar4) and the unrolled version really shine with >30% improvement on native.V1 does benefit from those 4 128-bits pipeline on this code.

@ryao
Copy link
Contributor

ryao commented Nov 29, 2022

I haven't looked a Fletcher recently, but IIRC the algorithm is width-dependent (well, number of parallel computations-dependent).

It certainly is. I also just discovered that Zen 3 is capable of doing multiple vector additions in parallel, which allowed widening the parallel additions to be a performance win:

#14234 (comment)

This effect likely also exists on ARM processors, if any of them are capable of doing multiple vector additions in parallel. i.e. superscalar SIMD

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 29, 2022
@ryao ryao added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Nov 29, 2022
@dougallj
Copy link

This effect likely also exists on ARM processors, if any of them are capable of doing multiple vector additions in parallel. i.e. superscalar SIMD

Yeah, generally this is possible to varying extents on all but the smallest ARM cores. For example, the widest cores like the Cortex-X2 and Apple M1 can run four SIMD operations per cycle. Your code from above is pretty much perfect for them:

.LBB0_2:
        ldr     q16, [x1], #16
        uaddw   v4.2d, v4.2d, v16.2s
        uaddw2  v0.2d, v0.2d, v16.4s
        add     v1.2d, v0.2d, v1.2d
        add     v5.2d, v4.2d, v5.2d
        add     v6.2d, v5.2d, v6.2d
        add     v2.2d, v1.2d, v2.2d
        add     v3.2d, v2.2d, v3.2d
        add     v7.2d, v6.2d, v7.2d
        cmp     x1, x8
        b.lo    .LBB0_2

The add/uaddw[2] operations give a critical-path latency of two cycles, so you'd want at least eight SIMD operations to take full advantage of the core, and that's exactly how many you have :)

@ryao
Copy link
Contributor

ryao commented Dec 1, 2022

In theory, #14234 should now support aarch64. It does 8 parallel accumulator streams. GCC 12.2 generates the following for the native case:

        ldp     q21, q20, [x1]
        add     x1, x1, 32
        uaddw   v19.2d, v19.2d, v21.2s
        uaddw2  v18.2d, v18.2d, v21.4s
        uaddw   v17.2d, v17.2d, v20.2s
        uaddw2  v16.2d, v16.2d, v20.4s
        add     v7.2d, v7.2d, v19.2d
        add     v6.2d, v6.2d, v18.2d
        add     v5.2d, v5.2d, v17.2d
        add     v4.2d, v4.2d, v16.2d
        add     v3.2d, v3.2d, v7.2d
        add     v2.2d, v2.2d, v6.2d
        add     v1.2d, v1.2d, v5.2d
        add     v0.2d, v0.2d, v4.2d
        add     v25.2d, v25.2d, v3.2d
        add     v24.2d, v24.2d, v2.2d
        add     v23.2d, v23.2d, v1.2d
        add     v22.2d, v22.2d, v0.2d
        cmp     x2, x1
        bhi     .L3

For the byteswap case, GCC 12.2 generates:

.L8:
        ldp     q1, q0, [x1]
        add     x1, x1, 32
        rev32   v1.16b, v1.16b
        rev32   v0.16b, v0.16b
        uaddw   v21.2d, v21.2d, v1.2s
        uaddw2  v20.2d, v20.2d, v1.4s
        uaddw   v19.2d, v19.2d, v0.2s
        uaddw2  v18.2d, v18.2d, v0.4s
        add     v17.2d, v17.2d, v21.2d
        add     v16.2d, v16.2d, v20.2d
        add     v7.2d, v7.2d, v19.2d
        add     v6.2d, v6.2d, v18.2d
        add     v5.2d, v5.2d, v17.2d
        add     v4.2d, v4.2d, v16.2d
        add     v3.2d, v3.2d, v7.2d
        add     v2.2d, v2.2d, v6.2d
        add     v25.2d, v25.2d, v5.2d
        add     v24.2d, v24.2d, v4.2d
        add     v23.2d, v23.2d, v3.2d
        add     v22.2d, v22.2d, v2.2d
        cmp     x2, x1
        bhi     .L8

It has 1 more instruction than Clang's output, but it is close to optimal for 8 parallel accumulator streams. That might be overkill for the hardware, but I guess it would be good for future hardware. You can see the full disassembly for the important things from various compilers at godbolt:

https://gcc.godbolt.org/z/Kfja8Tzxq

In particular, GCC versions older than GCC 12 generate very inoptimal assembly.

@kevinzs2048
Copy link
Author

@ryao, thanks a lot! It looks that the GCC12 produced ASM looks quite nice to boost the neon performance. I will follow this to re-write the code.

@ryao
Copy link
Contributor

ryao commented Dec 1, 2022

@ryao, thanks a lot! It looks that the GCC12 produced ASM looks quite nice to boost the neon performance. I will follow this to re-write the code.

I would suggest staying with the 4 accumulator version here, but optimizing it based on the clang output that I provided earlier. If both PRs are merged and you switch to the 8 accumulator version, we will end up with two implementations that are basically the same (when GCC 12 or later, or Clang is used), which would mean one should be dropped in favor of another. Using different accumulator counts means that they can co-exist.

Also, more accumulator streams is not necessarily better. When the key loop is already optimal, adding more accumulator streams will not speed it up, but it will slow down the recombination needed to be done at the end, which becomes increasingly complex as the accumulator stream count increases.

Having implementations for both the 4 stream and 8 stream levels is likely ideal. I am hopeful that we could use generic GNU C to move away from hard coding assembly. #14234 serves as a proof of concept for that. The current results are good enough that it should be able to be merged based on them. At the moment, that PR needs people to test it to verify it has good performance on both aarch64 and ppc64, plus that it works correctly. I expect it does based on theory, but theory and practice are not always in agreement.

That being said, the review process of this PR was what inspired me to do #14234. I have been very surprised by what I have found as a result of that. I remember thinking about doing vector operations in generic C years ago to support multiple architectures, but compiler technology was not good enough for it at the time. I had forgotten about the idea until @rincebrain suggested it to me independently when I mentioned that I was looking into this. Things are not only different now, but on modern hardware, the result beats the work by Intel in 2015 that I had considered to be the best possible from AVX2, which has amazed me. :)

@ryao
Copy link
Contributor

ryao commented Dec 1, 2022

@kevinzs2048 The optimal assembly for the 4 accumulator version for aarch64 is likely this:

https://gcc.godbolt.org/z/Gh71o5vPb

Note that I changed the for loop to a do-while loop. This eliminates an unnecessary check to see if the loop will iterate, but it always will unless someone gave us a 0 byte buffer, so we can just assume that.

In theory, you could just save that output as module/zcommon/zfs_fletcher_aarch64_neon_impl.s, conditionally add it to the make file for aarch64 (so that it is assembled and linked) and then call into it from C in module/zcommon/zfs_fletcher_aarch64_neon.c like this:

extern void
fletcher_4_aarch64_neon_native_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

extern void
fletcher_4_aarch64_neon_byteswap_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_native(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_native_impl(ctx, buf, size);
        kfpu_end();
}

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_byteswap_impl(ctx, buf, size);
        kfpu_end();
}

That should give us maximum performance on typical aarch64 processors.

We would want to make sure that this is disabled on big endian builds.

@kevinzs2048
Copy link
Author

kevinzs2048 commented Dec 2, 2022

I would suggest staying with the 4 accumulator version here, but optimizing it based on the clang output that I provided earlier. If both PRs are merged and you switch to the 8 accumulator version, we will end up with two implementations that are basically the same (when GCC 12 or later, or Clang is used), which would mean one should be dropped in favor of another. Using different accumulator counts means that they can co-exist.

Also, more accumulator streams is not necessarily better. When the key loop is already optimal, adding more accumulator streams will not speed it up, but it will slow down the recombination needed to be done at the end, which becomes increasingly complex as the accumulator stream count increases.

Having implementations for both the 4 stream and 8 stream levels is likely ideal. I am hopeful that we could use generic GNU C to move away from hard coding assembly. #14234 serves as a proof of concept for that. The current results are good enough that it should be able to be merged based on them. At the moment, that PR needs people to test it to verify it has good performance on both aarch64 and ppc64, plus that it works correctly. I expect it does based on theory, but theory and practice are not always in agreement.

That being said, the review process of this PR was what inspired me to do #14234. I have been very surprised by what I have found as a result of that. I remember thinking about doing vector operations in generic C years ago to support multiple architectures, but compiler technology was not good enough for it at the time. I had forgotten about the idea until @rincebrain suggested it to me independently when I mentioned that I was looking into this. Things are not only different now, but on modern hardware, the result beats the work by Intel in 2015 that I had considered to be the best possible from AVX2, which has amazed me. :)

@kevinzs2048 The optimal assembly for the 4 accumulator version for aarch64 is likely this:

https://gcc.godbolt.org/z/Gh71o5vPb

Note that I changed the for loop to a do-while loop. This eliminates an unnecessary check to see if the loop will iterate, but it always will unless someone gave us a 0 byte buffer, so we can just assume that.

In theory, you could just save that output as module/zcommon/zfs_fletcher_aarch64_neon_impl.s, conditionally add it to the make file for aarch64 (so that it is assembled and linked) and then call into it from C in module/zcommon/zfs_fletcher_aarch64_neon.c like this:

extern void
fletcher_4_aarch64_neon_native_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

extern void
fletcher_4_aarch64_neon_byteswap_impl(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size);

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_native(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_native_impl(ctx, buf, size);
        kfpu_end();
}

ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_byteswap(fletcher_4_ctx_t *ctx, const void *buf,
    uint64_t size)
{
        kfpu_begin();
        fletcher_4_aarch64_neon_byteswap_impl(ctx, buf, size);
        kfpu_end();
}

That should give us maximum performance on typical aarch64 processors.

We would want to make sure that this is disabled on big endian builds.

Hi @ryao:
Thanks for the details advise, really appreciated! I'm working on it now and will update the patch.

@ryao
Copy link
Contributor

ryao commented Dec 11, 2022

There was a small mistake in the byteswap case for the prefetch code I provided where tend was not initialized to ip:

https://gcc.godbolt.org/z/5YM7K3n6d

That said, you seem to have already realized it:

kevinzs2048@2986024

I thought that I had made the same changes to both, but missed that. Sorry for the trouble that caused.

@kevinzs2048
Copy link
Author

kevinzs2048 commented Dec 12, 2022

I think we could split this into two implementations. The current no prefetch version will be renamed to neon from aarch64_neon and given the upgraded no prefetch assembly. This is more consistent with how things are on other platforms. The prefetch version would then be named neon_prefetch and be a second patch. The difference on the kupeng 920 should meet the threshold for another implementation being worthwhile.

That said, would you check if this version improves the byteswap case on the altra?

https://gcc.godbolt.org/z/Pefexjj48

If the cores work the way I think they work, it should make the byteswap case perform identically to the native case.

@ryao I agree with you, will submit another patch for aarch64_neon_prefetch to address this.

For the performance, 3 realization:

The above link's code
0 0 0x01 -1 0 1444940378440 1445635521280
implementation   native         byteswap
scalar           4464590296     4042936458
superscalar      6250453028     5416338231
superscalar4     7284307051     5885751171
aarch64_neon     11673759602    8098362681
fastest          aarch64_neon   aarch64_neon

Prefetch L1:
0 0 0x01 -1 0 1531826195440 1532457802000
implementation   native         byteswap
scalar           4466492023     4002504007
superscalar      6208816650     5434444156
superscalar4     7304350249     5882614305
aarch64_neon     11162584720    9337275155
fastest          aarch64_neon   aarch64_neon

The original commit:

0 0 0x01 -1 0 1567011165720 1567651130720
implementation   native         byteswap
scalar           4464115117     4065035859
superscalar      6230953442     5433036269
superscalar4     7304859103     5830929210
aarch64_neon     11757093735    9937539093
fastest          aarch64_neon   aarch64_neon

@kevinzs2048
Copy link
Author

There was a small mistake in the byteswap case for the prefetch code I provided where tend was not initialized to ip:

https://gcc.godbolt.org/z/5YM7K3n6d

That said, you seem to have already realized it:

kevinzs2048@2986024

I thought that I had made the same changes to both, but missed that. Sorry for the trouble that caused.

No worries, after the core dump I found that the register is not right. :-)

@ryao
Copy link
Contributor

ryao commented Dec 12, 2022

For the performance, 3 realization:

The above link's code
0 0 0x01 -1 0 1444940378440 1445635521280
implementation   native         byteswap
scalar           4464590296     4042936458
superscalar      6250453028     5416338231
superscalar4     7284307051     5885751171
aarch64_neon     11673759602    8098362681
fastest          aarch64_neon   aarch64_neon

The original commit:

0 0 0x01 -1 0 1567011165720 1567651130720
implementation   native         byteswap
scalar           4464115117     4065035859
superscalar      6230953442     5433036269
superscalar4     7304859103     5830929210
aarch64_neon     11757093735    9937539093
fastest          aarch64_neon   aarch64_neon

That means that the core design does not work the way I thought it did. I guess the previous version is the best we can do there for the byteswap case.

@kevinzs2048
Copy link
Author

@ryao Please help to review again. There are some jobs failing. It looks that the master commit CI jobs were also failed: #14287

@ryao
Copy link
Contributor

ryao commented Dec 14, 2022

@kevinzs2048 The buildbot is sometimes flaky. That is not the result of anything you did here.

That being said, there are a few problems with the current version that will cause it to generate incorrect checksums. This is easy to verify by creating a pool with an unpatched driver and then trying to use it with the patched driver:

# Make test pool
truncate -s 1G /tmp/zfs
sudo zpool create tank /tmp/zfs

# This loads the new modules
sudo ./scripts/zfs.sh -v -r

# Import old pool and encounter failures, assuming that the neon implementation performs best.
sudo zpool import -d /tmp tank

The definition of zfs_fletcher_aarch64_neon_t needs to be changed from having v[2] to v[4] because this code switches to 4 accumulator streams from 2. The number of additions confusingly remains the same because the 2 accumulator version has been unrolled to do 2 iterations per loop. I mentioned that here:

#14219 (comment)

We avoided writing into beyond our heap allocation only because zfs_fletcher_aarch64_neon_t is part of a union type with zfs_fletcher_superscalar_t. Had it not been for that, the benchmark would have been writing beyond the end of the memory allocation.

The 2 stream mixing matrix in fletcher_4_aarch64_neon_fini() needs to be replaced with the 4 stream mixing matrix from module/zcommon/zfs_fletcher_intel.c. When that is done, avx in it should changed to aarch64_neon (or even better, we can rename aarch64_neon to just neon in fletcher_4_ctx_t to be consistent with everything else in that union. I mentioned that need to adopt the 4 stream mixing matrix here:

#14219 (review)

In hindsight, I should have explained where to find the 4 stream mixing matrix.

Also, after you change the mixing matrix, the indentation will need to be fixed according to what ./scripts/cstyle.pl module/zcommon/zfs_fletcher_aarch64_neon.c says.

@kevinzs2048 kevinzs2048 force-pushed the fletcher4-neon branch 4 times, most recently from a60a7f3 to 634f7df Compare December 19, 2022 08:09
Optimize the neon implementation which is generated by GCC12
or Clang by the commit: openzfs#14234

Due to the compiler limitation, the GCC version before 12 will
produce not optimized code for Aarch64, so I use the asm function
to accelerate the performance here.

Performance Data for Arm64 N1 Server Ampere Altra, 160C, 512G
The original neon implementation:
0 0 0x01 -1 0 62475043711240 62475677462040
implementation   native         byteswap
scalar           4465160644     4037798914
superscalar      6233546354     5449837582
superscalar4     7310461184     5828498374
aarch64_neon     8042460500     7427753772
fastest          aarch64_neon   aarch64_neon

The Optimized Neon:
0 0 0x01 -1 0 582868704627960 582869571278240
implementation   native         byteswap
scalar           4466111548     4006174065
superscalar      6217652465     5435852773
superscalar4     7305877024     5884760221
aarch64_neon     11783959542    10005496183
fastest          aarch64_neon   aarch64_neon

Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
Co-authored-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Copy link
Contributor

@ryao ryao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@ryao ryao added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jan 8, 2023
@kevinzs2048
Copy link
Author

@behlendorf Could you help to review? Thanks! @ryao already left comment here:

@ryao
Copy link
Contributor

ryao commented Jan 11, 2023

He just got back from a break and has a backlog of PRs to review. Marking this PR with the code review needed flag put it on a list of PRs that should have his attention soon.

The last couple weeks are a slow time of year for getting things merged since many project members are on vacation.

@kevinzs2048
Copy link
Author

@ryao Thanks for that info, no hurries, I will wait.


void
fletcher_4_aarch64_neon_native(
fletcher_4_ctx_t *ctx, const void *buf, uint64_t size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra tab

Suggested change
fletcher_4_ctx_t *ctx, const void *buf, uint64_t size)
fletcher_4_ctx_t *ctx, const void *buf, uint64_t size)

@@ -0,0 +1,121 @@
/*
* GNU C source file that we manually compile with Clang for aarch64.
* The compiled assembly goes into:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extend this comment to show the exact steps to generate module/zcommon/zfs_fletcher_aarch64_neon_impl.S.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang -O2 --target=aarch64 -S -o module/zcommon/zfs_fletcher_aarch64_neon_impl.S contrib/zcommon/zfs_fletcher_aarch64_neon_impl.c should do it.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! This LGTM based on your test results, however I wasn't able to test it myself. It would have been nice to be able able to test this on a wider range of hardware, but I think we can still move forward with it.

@ryao
Copy link
Contributor

ryao commented Jan 13, 2023

Thanks for the contribution! This LGTM based on your test results, however I wasn't able to test it myself. It would have been nice to be able able to test this on a wider range of hardware, but I think we can still move forward with it.

This makes a few fundamental improvements over the existing code.

  • It switches from the 2 stream 2 loop iteration unrolled version to the 4 stream version.
  • The load instruction and loop calculations prior to the cmp instruction have been combined.
  • Widening additions have been used to avoid the need for explicit widening from 32-bit to 64-bit via the zip instructions.

The result should perform better than the existing code on all aarch64 CPUs where we are not already memory bandwidth limited.

@behlendorf
Copy link
Contributor

I've gone ahead and pulled in #14228 to master. If you could rebase this PR, make any needed updates, and address those two minor outstanding comments this should be good to go. Thanks!

@ryao
Copy link
Contributor

ryao commented Jan 19, 2023

@behlendorf It occurs to me that we might be able to treat this in a special way considering that this file is machine generated by LLVM/Clang and Windows support is being developed exclusively around LLVM/Clang. What I am thinking is:

  • On UNIX systems, we use the cached assembly.

  • On NT systems, we compile the GNU C code directly, with appropriate compiler flags so that neon instructions are emitted.

In theory, this should avoid the need for us to to manually fix the output of the compiler for the same assembly to be usable by both.

@lundman What do you think?

If people are not happy with that idea, another possibility would be to tell LLVM/Clang to output a Windows version for Windows systems.

@lundman
Copy link
Contributor

lundman commented Feb 27, 2023

I've not read the PR fully, initial thoughts are generally:

At the moment, NT produced assembly files are useless to me, since I do everything with clang. The only thing we needed to change was the calling convention (registers used).

If I had a vote, I'd prefer a solution like aarch64/asm_linkage.h, then the assembly files using ENTRY(), SETSIZE etc.
Similar to M1/ARM64 blake3 I did over here:

openzfsonosx@51a4ee6

If I have no vote, just do whats good for you guys, and I'll port around it later.

....

But I have no Windows/arm port at all yet, and M1/ARM64 has no neon.h so I've been unable to do anything neon there.

@ryao
Copy link
Contributor

ryao commented Feb 28, 2023

I've not read the PR fully, initial thoughts are generally:

At the moment, NT produced assembly files are useless to me, since I do everything with clang. The only thing we needed to change was the calling convention (registers used).

If I had a vote, I'd prefer a solution like aarch64/asm_linkage.h, then the assembly files using ENTRY(), SETSIZE etc. Similar to M1/ARM64 blake3 I did over here:

openzfsonosx@51a4ee6

If I have no vote, just do whats good for you guys, and I'll port around it later.

....

But I have no Windows/arm port at all yet, and M1/ARM64 has no neon.h so I've been unable to do anything neon there.

In an ideal world, we would not be using assembly at all here and would just compile the .c file directly. However, GCC generates really bad code, so we need to cache the .s file from Clang to avoid bad code generation when using GCC. Unfortunately, implementing that workaround now means that we have a .s file that is not compliant with the coding standard used for the existing .s files in the tree to make them portable. However, since this .s file is actually compiled from C code, rather than editing it by hand to make it compliant with that coding standard, we could just compile the .c file with Clang directly on the platforms that cannot use the cached .s file.

My question was if you are okay with making that exception to the rule. I feel like this should not cause any work for you. The only thing you would need to do differently is have cmake grab the .c file instead of the .s file.

@lundman
Copy link
Contributor

lundman commented Feb 28, 2023

Ah is that what you are doing, yes, if the .S is just the clang produced output, then using the C file on clang platforms make perfect sense.

@ryao
Copy link
Contributor

ryao commented Feb 28, 2023

Ah is that what you are doing, yes, if the .S is just the clang produced output, then using the C file on clang platforms make perfect sense.

The initial plan is to just reuse the cached .s file even with clang to keep things simple and let cmake use the .c file on Windows. I suspect that MacOS will need the autotools based build system changed to use the .c file directly when using any compiler, but I would prefer to leave that for a follow-up patch (that I am willing to volunteer to write). If that is alright with you, then we can merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants