Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimize fletcher4 calculations #14247

Merged
merged 1 commit into from
Dec 5, 2022
Merged

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Dec 2, 2022

Motivation and Context

When processing abds, we execute 1 kfpu_begin()/kfpu_end() pair on every page in the abd. This is wasteful and slows down checksum performance versus what the fletcher4 benchmark claims it to be.

Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration.

Description

We move the kfpu_begin()/kfpu_end() calls to the init and fini functions so that it is only called once per abd.

We also micro-optimize the checksum calculations by switching to do-while loops to skip the first loop condition check. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in
fletcher_4_incremental_native()/fletcher_4_incremental_byteswap() against 0 sized buffers being passed.

How Has This Been Tested?

The buildbot can test it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ryao ryao changed the title Micro-optimize fletcher4 assembly routines Micro-optimize fletcher4 calculations Dec 2, 2022
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfpu part should probably make huge effect for Linux to not switch FPU context twice per page. On FreeBSD though all ZFS taskqueue thrteads are globally marked as using FPU, so this change should be NOP, that is why I probably never saw it on my profiles.

Did you leave kfpu_* in fletcher_4_avx512f_byteswap() intentionally or just missed them?

@ryao
Copy link
Contributor Author

ryao commented Dec 2, 2022

Did you leave kfpu_* in fletcher_4_avx512f_byteswap() intentionally or just missed them?

I missed it. I will fix it and repush.

When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
@ryao
Copy link
Contributor Author

ryao commented Dec 2, 2022

I have fixed the oversights.

@ryao ryao requested a review from amotin December 2, 2022 17:52
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find, I agree this might result in a nice improvement on Linux.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Dec 2, 2022
@ryao
Copy link
Contributor Author

ryao commented Dec 2, 2022

kfpu part should probably make huge effect for Linux to not switch FPU context twice per page. On FreeBSD though all ZFS taskqueue thrteads are globally marked as using FPU, so this change should be NOP, that is why I probably never saw it on my profiles.

I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it.

@amotin
Copy link
Member

amotin commented Dec 3, 2022

I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it.

That is why I am using hardware PMC for profiling, using NMI for sampling and so not caring about any locks or disabled interrupts.

@ryao
Copy link
Contributor Author

ryao commented Dec 3, 2022

I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it.

That is why I am using hardware PMC for profiling, using NMI for sampling and so not caring about any locks or disabled interrupts.

How do you do that?

@amotin
Copy link
Member

amotin commented Dec 3, 2022

How do you do that?

On FreeBSD I am using pmcstat, on Linux I think it is perf record. Results then processing with: https://github.com/brendangregg/FlameGraph .

@ryao
Copy link
Contributor Author

ryao commented Dec 3, 2022

How do you do that?

On FreeBSD I am using pmcstat, on Linux I think it is perf record. Results then processing with: https://github.com/brendangregg/FlameGraph .

I am already using perf record, except I am using the -F flag to sample on an interval. perf can use PMU events, but it is not clear that it does for -F, especially on my machine where I have AMD hardware. I will need to look into this some more.

@ryao
Copy link
Contributor Author

ryao commented Dec 3, 2022

It seems that perf record -F uses the PMU too. I just have never noticed fletcher code in my profiles. I will need to do some tests to see what is really happening. I will not be able to do those for a while.

@behlendorf behlendorf merged commit 59493b6 into openzfs:master Dec 5, 2022
@ryao
Copy link
Contributor Author

ryao commented Dec 5, 2022

Someone showed me a profile taken on Linux that showed around 5% to 10% time spent in fletcher4. It just runs so incredibly fast that very little time is spent on it on most workloads. That made me mistakenly think that there was a visibility barrier.

andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Dec 16, 2022
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14247
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Dec 17, 2022
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14247
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 7, 2023
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on
every page in the abd. This is wasteful and slows down checksum
performance versus what the benchmark claimed. We correct this by moving
those calls to the init and fini functions.

Also, we always check the buffer length against 0 before calling the
non-scalar checksum functions. This means that we do not need to execute
the loop condition for the first loop iteration. That allows us to
micro-optimize the checksum calculations by switching to do-while loops.

Note that we do not apply that micro-optimization to the scalar
implementation because there is no check in
`fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()`
against 0 sized buffers being passed.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14247
AttilaFueloep added a commit to AttilaFueloep/zfs that referenced this pull request Mar 9, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Signed-off-by: Attila Fülöp <attila@fueloep.org>
AttilaFueloep added a commit to AttilaFueloep/zfs that referenced this pull request Mar 11, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Signed-off-by: Attila Fülöp <attila@fueloep.org>
AttilaFueloep added a commit to AttilaFueloep/zfs that referenced this pull request Mar 11, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Signed-off-by: Attila Fülöp <attila@fueloep.org>
AttilaFueloep added a commit to AttilaFueloep/zfs that referenced this pull request Mar 12, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Signed-off-by: Attila Fülöp <attila@fueloep.org>
behlendorf pushed a commit that referenced this pull request Mar 14, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in #14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14600
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 16, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#14600
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants