-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro-optimize fletcher4 calculations #14247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kfpu part should probably make huge effect for Linux to not switch FPU context twice per page. On FreeBSD though all ZFS taskqueue thrteads are globally marked as using FPU, so this change should be NOP, that is why I probably never saw it on my profiles.
Did you leave kfpu_* in fletcher_4_avx512f_byteswap() intentionally or just missed them?
I missed it. I will fix it and repush. |
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
I have fixed the oversights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find, I agree this might result in a nice improvement on Linux.
I have never seen it in profiles on Linux either. This code executes with interrupts disabled on Linux, so the profiler should not have been able to sample it. |
That is why I am using hardware PMC for profiling, using NMI for sampling and so not caring about any locks or disabled interrupts. |
How do you do that? |
On FreeBSD I am using |
I am already using perf record, except I am using the -F flag to sample on an interval. perf can use PMU events, but it is not clear that it does for -F, especially on my machine where I have AMD hardware. I will need to look into this some more. |
It seems that perf record -F uses the PMU too. I just have never noticed fletcher code in my profiles. I will need to do some tests to see what is really happening. I will not be able to do those for a while. |
Someone showed me a profile taken on Linux that showed around 5% to 10% time spent in fletcher4. It just runs so incredibly fast that very little time is spent on it on most workloads. That made me mistakenly think that there was a visibility barrier. |
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes openzfs#14247
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes openzfs#14247
When processing abds, we execute 1 `kfpu_begin()`/`kfpu_end()` pair on every page in the abd. This is wasteful and slows down checksum performance versus what the benchmark claimed. We correct this by moving those calls to the init and fini functions. Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration. That allows us to micro-optimize the checksum calculations by switching to do-while loops. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in `fletcher_4_incremental_native()`/`fletcher_4_incremental_byteswap()` against 0 sized buffers being passed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes openzfs#14247
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Signed-off-by: Attila Fülöp <attila@fueloep.org>
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Signed-off-by: Attila Fülöp <attila@fueloep.org>
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Signed-off-by: Attila Fülöp <attila@fueloep.org>
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Signed-off-by: Attila Fülöp <attila@fueloep.org>
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14600
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes openzfs#14600
Motivation and Context
When processing abds, we execute 1
kfpu_begin()
/kfpu_end()
pair on every page in the abd. This is wasteful and slows down checksum performance versus what the fletcher4 benchmark claims it to be.Also, we always check the buffer length against 0 before calling the non-scalar checksum functions. This means that we do not need to execute the loop condition for the first loop iteration.
Description
We move the
kfpu_begin()
/kfpu_end()
calls to the init and fini functions so that it is only called once per abd.We also micro-optimize the checksum calculations by switching to do-while loops to skip the first loop condition check. Note that we do not apply that micro-optimization to the scalar implementation because there is no check in
fletcher_4_incremental_native()
/fletcher_4_incremental_byteswap()
against 0 sized buffers being passed.How Has This Been Tested?
The buildbot can test it.
Types of changes
Checklist:
Signed-off-by
.