Micro-optimize fletcher4 calculations #14247

ryao · 2022-12-02T01:31:43Z

Motivation and Context

When processing abds, we execute 1 kfpu_begin()/kfpu_end() pair on every page in the abd. This is wasteful and slows down checksum performance versus what the fletcher4 benchmark claims it to be.

Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration.

Description

We move the kfpu_begin()/kfpu_end() calls to the init and fini functions so that it is only called once per abd.

We also micro-optimize the checksum calculations by switching to do-while loops to skip the first loop condition check. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in
fletcher_4_incremental_native()/fletcher_4_incremental_byteswap() against 0 sized buffers being passed.

How Has This Been Tested?

The buildbot can test it.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

kfpu part should probably make huge effect for Linux to not switch FPU context twice per page. On FreeBSD though all ZFS taskqueue thrteads are globally marked as using FPU, so this change should be NOP, that is why I probably never saw it on my profiles.

Did you leave kfpu_* in fletcher_4_avx512f_byteswap() intentionally or just missed them?

ryao · 2022-12-02T17:30:54Z

Did you leave kfpu_* in fletcher_4_avx512f_byteswap() intentionally or just missed them?

I missed it. I will fix it and repush.

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

ryao · 2022-12-02T17:52:19Z

I have fixed the oversights.

behlendorf

Good find, I agree this might result in a nice improvement on Linux.

ryao · 2022-12-02T22:41:15Z

kfpu part should probably make huge effect for Linux to not switch FPU context twice per page. On FreeBSD though all ZFS taskqueue thrteads are globally marked as using FPU, so this change should be NOP, that is why I probably never saw it on my profiles.

I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it.

amotin · 2022-12-03T01:28:53Z

I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it.

That is why I am using hardware PMC for profiling, using NMI for sampling and so not caring about any locks or disabled interrupts.

ryao · 2022-12-03T01:44:48Z

I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it.

That is why I am using hardware PMC for profiling, using NMI for sampling and so not caring about any locks or disabled interrupts.

How do you do that?

amotin · 2022-12-03T03:04:32Z

How do you do that?

On FreeBSD I am using pmcstat, on Linux I think it is perf record. Results then processing with: https://github.com/brendangregg/FlameGraph .

ryao · 2022-12-03T03:30:07Z

How do you do that?

On FreeBSD I am using pmcstat, on Linux I think it is perf record. Results then processing with: https://github.com/brendangregg/FlameGraph .

I am already using perf record, except I am using the -F flag to sample on an interval. perf can use PMU events, but it is not clear that it does for -F, especially on my machine where I have AMD hardware. I will need to look into this some more.

ryao · 2022-12-03T03:44:19Z

It seems that perf record -F uses the PMU too. I just have never noticed fletcher code in my profiles. I will need to do some tests to see what is really happening. I will not be able to do those for a while.

ryao · 2022-12-05T20:02:25Z

Someone showed me a profile taken on Linux that showed around 5% to 10% time spent in fletcher4. It just runs so incredibly fast that very little time is spent on it on most workloads. That made me mistakenly think that there was a visibility barrier.

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes openzfs#14247

Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Signed-off-by: Attila Fülöp <attila@fueloep.org>

Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14600

Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes openzfs#14600

ryao force-pushed the fletcher-perf branch from b0c3a87 to 503624b Compare December 2, 2022 01:33

ryao changed the title ~~Micro-optimize fletcher4 assembly routines~~ Micro-optimize fletcher4 calculations Dec 2, 2022

ryao force-pushed the fletcher-perf branch from 503624b to c3ce789 Compare December 2, 2022 01:38

ryao mentioned this pull request Dec 2, 2022

Optimize the fletcher4 neon implementation #14219

Open

13 tasks

amotin reviewed Dec 2, 2022

View reviewed changes

ryao force-pushed the fletcher-perf branch from c3ce789 to 66ce42c Compare December 2, 2022 17:52

ryao requested a review from amotin December 2, 2022 17:52

amotin approved these changes Dec 2, 2022

View reviewed changes

behlendorf approved these changes Dec 2, 2022

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Dec 2, 2022

behlendorf merged commit 59493b6 into openzfs:master Dec 5, 2022

AttilaFueloep mentioned this pull request Mar 9, 2023

zcommon: Refactor FPU state handling in fletcher4 #14600

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro-optimize fletcher4 calculations #14247

Micro-optimize fletcher4 calculations #14247

ryao commented Dec 2, 2022 •

edited

Loading

amotin left a comment

ryao commented Dec 2, 2022

ryao commented Dec 2, 2022

behlendorf left a comment

ryao commented Dec 2, 2022

amotin commented Dec 3, 2022 •

edited

Loading

ryao commented Dec 3, 2022

amotin commented Dec 3, 2022

ryao commented Dec 3, 2022

ryao commented Dec 3, 2022

ryao commented Dec 5, 2022

Micro-optimize fletcher4 calculations #14247

Micro-optimize fletcher4 calculations #14247

Conversation

ryao commented Dec 2, 2022 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin left a comment

Choose a reason for hiding this comment

ryao commented Dec 2, 2022

ryao commented Dec 2, 2022

behlendorf left a comment

Choose a reason for hiding this comment

ryao commented Dec 2, 2022

amotin commented Dec 3, 2022 • edited Loading

ryao commented Dec 3, 2022

amotin commented Dec 3, 2022

ryao commented Dec 3, 2022

ryao commented Dec 3, 2022

ryao commented Dec 5, 2022

ryao commented Dec 2, 2022 •

edited

Loading

amotin commented Dec 3, 2022 •

edited

Loading