-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zcommon: Refactor FPU state handling in fletcher4 #14600
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -160,6 +160,7 @@ static const fletcher_4_ops_t fletcher_4_scalar_ops = { | |
.fini_byteswap = fletcher_4_scalar_fini, | ||
.compute_byteswap = fletcher_4_scalar_byteswap, | ||
.valid = fletcher_4_scalar_valid, | ||
.uses_fpu = B_FALSE, | ||
.name = "scalar" | ||
}; | ||
|
||
|
@@ -458,9 +459,15 @@ fletcher_4_native_impl(const void *buf, uint64_t size, zio_cksum_t *zcp) | |
fletcher_4_ctx_t ctx; | ||
const fletcher_4_ops_t *ops = fletcher_4_impl_get(); | ||
|
||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_begin(); | ||
} | ||
ops->init_native(&ctx); | ||
ops->compute_native(&ctx, buf, size); | ||
ops->fini_native(&ctx, zcp); | ||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_end(); | ||
} | ||
} | ||
|
||
void | ||
|
@@ -500,9 +507,15 @@ fletcher_4_byteswap_impl(const void *buf, uint64_t size, zio_cksum_t *zcp) | |
fletcher_4_ctx_t ctx; | ||
const fletcher_4_ops_t *ops = fletcher_4_impl_get(); | ||
|
||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_begin(); | ||
} | ||
ops->init_byteswap(&ctx); | ||
ops->compute_byteswap(&ctx, buf, size); | ||
ops->fini_byteswap(&ctx, zcp); | ||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_end(); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The same applies to the native versions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The intention was to minimize the time we're running with interrupts and preemption disabled. So this is a trade-off between ease of future changes and an actual advantage. I'm fine either way, let's see what other reviewers think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 59493b6 already made a decent reduction in the total time that we spend with interrupts disabled not that long ago by not doing context save/restore multiple times per ABD. At the same time, it did An extra microsecond or two with interrupts disabled will not matter in the slightest since the systems we support do not give real time guarantees. If we were running on a system that tried to give real time guarantees and it were so pedantic to give guarantees that were accurate to a level where a microsecond or two difference would matter, we probably would not be allowed to run with interrupts disabled at all, since disabling interrupts to use SIMD could cause the real time guarantee to be violated. I am not sure if even real-time Linux is that strict. Anyway, what is important here is not minimizing time with interrupts disabled, but minimizing the total time spent doing processing. If the interrupts are disabled slightly longer than absolutely necessary to reduce the total CPU time spent, then that is preferable. In this case, doing it this way is mostly to prepare for a future small reduction in CPU time. Since we know that there is a possibility of that here, I would prefer us to do it that way so that we do not need another patch. We really are not gaining anything from an extra microsecond or two with interrupts enabled per checksum operation, so there is no need to be frugal on that level. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's sound reasoning, will change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
} | ||
|
||
void | ||
|
@@ -661,6 +674,7 @@ fletcher_4_kstat_addr(kstat_t *ksp, loff_t n) | |
fletcher_4_fastest_impl.init_ ## type = src->init_ ## type; \ | ||
fletcher_4_fastest_impl.fini_ ## type = src->fini_ ## type; \ | ||
fletcher_4_fastest_impl.compute_ ## type = src->compute_ ## type; \ | ||
fletcher_4_fastest_impl.uses_fpu = src->uses_fpu; \ | ||
} | ||
|
||
#define FLETCHER_4_BENCH_NS (MSEC2NSEC(1)) /* 1ms */ | ||
|
@@ -816,10 +830,14 @@ abd_fletcher_4_init(zio_abd_checksum_data_t *cdp) | |
const fletcher_4_ops_t *ops = fletcher_4_impl_get(); | ||
cdp->acd_private = (void *) ops; | ||
|
||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_begin(); | ||
} | ||
if (cdp->acd_byteorder == ZIO_CHECKSUM_NATIVE) | ||
ops->init_native(cdp->acd_ctx); | ||
else | ||
ops->init_byteswap(cdp->acd_ctx); | ||
|
||
} | ||
|
||
static void | ||
|
@@ -833,8 +851,13 @@ abd_fletcher_4_fini(zio_abd_checksum_data_t *cdp) | |
ops->fini_native(cdp->acd_ctx, cdp->acd_zcp); | ||
else | ||
ops->fini_byteswap(cdp->acd_ctx, cdp->acd_zcp); | ||
|
||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_end(); | ||
} | ||
} | ||
|
||
|
||
static void | ||
abd_fletcher_4_simd2scalar(boolean_t native, void *data, size_t size, | ||
zio_abd_checksum_data_t *cdp) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that we could mark this structure as requiring 64-byte alignment so that subsequent accesses are L1 cache accesses.
As a more general change, we could also rename
->init_native
and->fini_native
to->init
and->fini
respectively and drop the separateinit_byteswap()
/fini_byteswap()
functions, since we only need to have different compute functions to support different endianness. That should be a separate commit, but it is entirely optional.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about that, it would squeeze the size of the struct below 64 bytes again, currently it is 72 bytes. Let me have a more thorough look tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fitting the entire structure in a cacheline does not matter that much here because the name field is not accessed during normal operation, so just fitting everything but that is fine. I just thought it would be nice to simplify this code because the extra init and fini functions make it slightly harder to understand.
Aligning the structure to a cacheline on the other hand should avoid another memory access whenever the fields needed during normal operation would have spanned two cachelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove
init_byteswap()/fini_byteswap()
in a follow up PR. Just wondering why they were implemented in the first place.