Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zcommon: Refactor FPU state handling in fletcher4 #14600

Merged
merged 1 commit into from
Mar 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion include/zfs_fletcher.h
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,9 @@ typedef struct fletcher_4_func {
fletcher_4_fini_f fini_byteswap;
fletcher_4_compute_f compute_byteswap;
boolean_t (*valid)(void);
boolean_t uses_fpu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we could mark this structure as requiring 64-byte alignment so that subsequent accesses are L1 cache accesses.

As a more general change, we could also rename ->init_native and ->fini_native to ->init and ->fini respectively and drop the separate init_byteswap()/fini_byteswap() functions, since we only need to have different compute functions to support different endianness. That should be a separate commit, but it is entirely optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, it would squeeze the size of the struct below 64 bytes again, currently it is 72 bytes. Let me have a more thorough look tomorrow.

Copy link
Contributor

@ryao ryao Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fitting the entire structure in a cacheline does not matter that much here because the name field is not accessed during normal operation, so just fitting everything but that is fine. I just thought it would be nice to simplify this code because the extra init and fini functions make it slightly harder to understand.

Aligning the structure to a cacheline on the other hand should avoid another memory access whenever the fields needed during normal operation would have spanned two cachelines.

Copy link
Contributor Author

@AttilaFueloep AttilaFueloep Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove init_byteswap()/fini_byteswap() in a follow up PR. Just wondering why they were implemented in the first place.

const char *name;
} fletcher_4_ops_t;
} __attribute__((aligned(64))) fletcher_4_ops_t;

_ZFS_FLETCHER_H const fletcher_4_ops_t fletcher_4_superscalar_ops;
_ZFS_FLETCHER_H const fletcher_4_ops_t fletcher_4_superscalar4_ops;
Expand Down
19 changes: 11 additions & 8 deletions lib/libzfs/libzfs.abi
Original file line number Diff line number Diff line change
Expand Up @@ -578,13 +578,13 @@
<elf-variable-symbols>
<elf-symbol name='efi_debug' size='4' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_abd_ops' size='24' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_avx2_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_avx512bw_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_avx512f_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_sse2_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_ssse3_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_superscalar4_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_superscalar_ops' size='64' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_avx2_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_avx512bw_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_avx512f_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_sse2_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_ssse3_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_superscalar4_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='fletcher_4_superscalar_ops' size='128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='libzfs_config_ops' size='16' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='sa_protocol_names' size='16' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
<elf-symbol name='spa_feature_table' size='2128' type='object-type' binding='global-binding' visibility='default-visibility' is-defined='yes'/>
Expand Down Expand Up @@ -9053,7 +9053,7 @@
<typedef-decl name='fletcher_4_init_f' type-id='173aa527' id='b9ae1656'/>
<typedef-decl name='fletcher_4_fini_f' type-id='0ad5b8a8' id='c4c1f4fc'/>
<typedef-decl name='fletcher_4_compute_f' type-id='38147eff' id='ad1dc4cb'/>
<class-decl name='fletcher_4_func' size-in-bits='512' is-struct='yes' visibility='default' id='57f479a0'>
<class-decl name='fletcher_4_func' size-in-bits='1024' is-struct='yes' visibility='default' id='57f479a0'>
<data-member access='public' layout-offset-in-bits='0'>
<var-decl name='init_native' type-id='b9ae1656' visibility='default'/>
</data-member>
Expand All @@ -9076,6 +9076,9 @@
<var-decl name='valid' type-id='297d38bc' visibility='default'/>
</data-member>
<data-member access='public' layout-offset-in-bits='448'>
<var-decl name='uses_fpu' type-id='c19b74c3' visibility='default'/>
</data-member>
<data-member access='public' layout-offset-in-bits='512'>
<var-decl name='name' type-id='80f4b756' visibility='default'/>
</data-member>
</class-decl>
Expand Down
23 changes: 23 additions & 0 deletions module/zcommon/zfs_fletcher.c
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ static const fletcher_4_ops_t fletcher_4_scalar_ops = {
.fini_byteswap = fletcher_4_scalar_fini,
.compute_byteswap = fletcher_4_scalar_byteswap,
.valid = fletcher_4_scalar_valid,
.uses_fpu = B_FALSE,
.name = "scalar"
};

Expand Down Expand Up @@ -458,9 +459,15 @@ fletcher_4_native_impl(const void *buf, uint64_t size, zio_cksum_t *zcp)
fletcher_4_ctx_t ctx;
const fletcher_4_ops_t *ops = fletcher_4_impl_get();

if (ops->uses_fpu == B_TRUE) {
kfpu_begin();
}
ops->init_native(&ctx);
ops->compute_native(&ctx, buf, size);
ops->fini_native(&ctx, zcp);
if (ops->uses_fpu == B_TRUE) {
kfpu_end();
}
}

void
Expand Down Expand Up @@ -500,9 +507,15 @@ fletcher_4_byteswap_impl(const void *buf, uint64_t size, zio_cksum_t *zcp)
fletcher_4_ctx_t ctx;
const fletcher_4_ops_t *ops = fletcher_4_impl_get();

if (ops->uses_fpu == B_TRUE) {
kfpu_begin();
}
ops->init_byteswap(&ctx);
ops->compute_byteswap(&ctx, buf, size);
ops->fini_byteswap(&ctx, zcp);
if (ops->uses_fpu == B_TRUE) {
kfpu_end();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfpu_begin() should be called before init_byteswap() and kfpu_end() should be called after fini_byteswap(). That way the init and fini functions are allowed to use vector instructions. A future patch to compile GNU C Vector code likely will make use of this.

The same applies to the native versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to minimize the time we're running with interrupts and preemption disabled. So this is a trade-off between ease of future changes and an actual advantage. I'm fine either way, let's see what other reviewers think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

59493b6 already made a decent reduction in the total time that we spend with interrupts disabled not that long ago by not doing context save/restore multiple times per ABD. At the same time, it did kfpu_begin() before the rest of the init function and kfpu_end() after the rest of the fini function, which passed review.

An extra microsecond or two with interrupts disabled will not matter in the slightest since the systems we support do not give real time guarantees. If we were running on a system that tried to give real time guarantees and it were so pedantic to give guarantees that were accurate to a level where a microsecond or two difference would matter, we probably would not be allowed to run with interrupts disabled at all, since disabling interrupts to use SIMD could cause the real time guarantee to be violated. I am not sure if even real-time Linux is that strict.

Anyway, what is important here is not minimizing time with interrupts disabled, but minimizing the total time spent doing processing. If the interrupts are disabled slightly longer than absolutely necessary to reduce the total CPU time spent, then that is preferable. In this case, doing it this way is mostly to prepare for a future small reduction in CPU time. Since we know that there is a possibility of that here, I would prefer us to do it that way so that we do not need another patch. We really are not gaining anything from an extra microsecond or two with interrupts enabled per checksum operation, so there is no need to be frugal on that level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sound reasoning, will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

void
Expand Down Expand Up @@ -661,6 +674,7 @@ fletcher_4_kstat_addr(kstat_t *ksp, loff_t n)
fletcher_4_fastest_impl.init_ ## type = src->init_ ## type; \
fletcher_4_fastest_impl.fini_ ## type = src->fini_ ## type; \
fletcher_4_fastest_impl.compute_ ## type = src->compute_ ## type; \
fletcher_4_fastest_impl.uses_fpu = src->uses_fpu; \
}

#define FLETCHER_4_BENCH_NS (MSEC2NSEC(1)) /* 1ms */
Expand Down Expand Up @@ -816,10 +830,14 @@ abd_fletcher_4_init(zio_abd_checksum_data_t *cdp)
const fletcher_4_ops_t *ops = fletcher_4_impl_get();
cdp->acd_private = (void *) ops;

if (ops->uses_fpu == B_TRUE) {
kfpu_begin();
}
if (cdp->acd_byteorder == ZIO_CHECKSUM_NATIVE)
ops->init_native(cdp->acd_ctx);
else
ops->init_byteswap(cdp->acd_ctx);

}

static void
Expand All @@ -833,8 +851,13 @@ abd_fletcher_4_fini(zio_abd_checksum_data_t *cdp)
ops->fini_native(cdp->acd_ctx, cdp->acd_zcp);
else
ops->fini_byteswap(cdp->acd_ctx, cdp->acd_zcp);

if (ops->uses_fpu == B_TRUE) {
kfpu_end();
}
}


static void
abd_fletcher_4_simd2scalar(boolean_t native, void *data, size_t size,
zio_abd_checksum_data_t *cdp)
Expand Down
3 changes: 1 addition & 2 deletions module/zcommon/zfs_fletcher_aarch64_neon.c
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@ ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_aarch64_neon_init(fletcher_4_ctx_t *ctx)
{
kfpu_begin();
memset(ctx->aarch64_neon, 0, 4 * sizeof (zfs_fletcher_aarch64_neon_t));
}

Expand All @@ -70,7 +69,6 @@ fletcher_4_aarch64_neon_fini(fletcher_4_ctx_t *ctx, zio_cksum_t *zcp)
8 * ctx->aarch64_neon[3].v[1] - 8 * ctx->aarch64_neon[2].v[1] +
ctx->aarch64_neon[1].v[1];
ZIO_SET_CHECKSUM(zcp, A, B, C, D);
kfpu_end();
}

#define NEON_INIT_LOOP() \
Expand Down Expand Up @@ -205,6 +203,7 @@ const fletcher_4_ops_t fletcher_4_aarch64_neon_ops = {
.compute_byteswap = fletcher_4_aarch64_neon_byteswap,
.fini_byteswap = fletcher_4_aarch64_neon_fini,
.valid = fletcher_4_aarch64_neon_valid,
.uses_fpu = B_TRUE,
.name = "aarch64_neon"
};

Expand Down
4 changes: 2 additions & 2 deletions module/zcommon/zfs_fletcher_avx512.c
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_avx512f_init(fletcher_4_ctx_t *ctx)
{
kfpu_begin();
memset(ctx->avx512, 0, 4 * sizeof (zfs_fletcher_avx512_t));
}

Expand Down Expand Up @@ -73,7 +72,6 @@ fletcher_4_avx512f_fini(fletcher_4_ctx_t *ctx, zio_cksum_t *zcp)
}

ZIO_SET_CHECKSUM(zcp, A, B, C, D);
kfpu_end();
}

#define FLETCHER_4_AVX512_RESTORE_CTX(ctx) \
Expand Down Expand Up @@ -166,6 +164,7 @@ const fletcher_4_ops_t fletcher_4_avx512f_ops = {
.fini_byteswap = fletcher_4_avx512f_fini,
.compute_byteswap = fletcher_4_avx512f_byteswap,
.valid = fletcher_4_avx512f_valid,
.uses_fpu = B_TRUE,
.name = "avx512f"
};

Expand Down Expand Up @@ -216,6 +215,7 @@ const fletcher_4_ops_t fletcher_4_avx512bw_ops = {
.fini_byteswap = fletcher_4_avx512f_fini,
.compute_byteswap = fletcher_4_avx512bw_byteswap,
.valid = fletcher_4_avx512bw_valid,
.uses_fpu = B_TRUE,
.name = "avx512bw"
};
#endif
Expand Down
3 changes: 1 addition & 2 deletions module/zcommon/zfs_fletcher_intel.c
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_avx2_init(fletcher_4_ctx_t *ctx)
{
kfpu_begin();
memset(ctx->avx, 0, 4 * sizeof (zfs_fletcher_avx_t));
}

Expand Down Expand Up @@ -82,7 +81,6 @@ fletcher_4_avx2_fini(fletcher_4_ctx_t *ctx, zio_cksum_t *zcp)
64 * ctx->avx[3].v[3];

ZIO_SET_CHECKSUM(zcp, A, B, C, D);
kfpu_end();
}

#define FLETCHER_4_AVX2_RESTORE_CTX(ctx) \
Expand Down Expand Up @@ -163,6 +161,7 @@ const fletcher_4_ops_t fletcher_4_avx2_ops = {
.fini_byteswap = fletcher_4_avx2_fini,
.compute_byteswap = fletcher_4_avx2_byteswap,
.valid = fletcher_4_avx2_valid,
.uses_fpu = B_TRUE,
.name = "avx2"
};

Expand Down
4 changes: 2 additions & 2 deletions module/zcommon/zfs_fletcher_sse.c
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@ ZFS_NO_SANITIZE_UNDEFINED
static void
fletcher_4_sse2_init(fletcher_4_ctx_t *ctx)
{
kfpu_begin();
memset(ctx->sse, 0, 4 * sizeof (zfs_fletcher_sse_t));
}

Expand Down Expand Up @@ -81,7 +80,6 @@ fletcher_4_sse2_fini(fletcher_4_ctx_t *ctx, zio_cksum_t *zcp)
8 * ctx->sse[2].v[1] + ctx->sse[1].v[1];

ZIO_SET_CHECKSUM(zcp, A, B, C, D);
kfpu_end();
}

#define FLETCHER_4_SSE_RESTORE_CTX(ctx) \
Expand Down Expand Up @@ -164,6 +162,7 @@ const fletcher_4_ops_t fletcher_4_sse2_ops = {
.fini_byteswap = fletcher_4_sse2_fini,
.compute_byteswap = fletcher_4_sse2_byteswap,
.valid = fletcher_4_sse2_valid,
.uses_fpu = B_TRUE,
.name = "sse2"
};

Expand Down Expand Up @@ -218,6 +217,7 @@ const fletcher_4_ops_t fletcher_4_ssse3_ops = {
.fini_byteswap = fletcher_4_sse2_fini,
.compute_byteswap = fletcher_4_ssse3_byteswap,
.valid = fletcher_4_ssse3_valid,
.uses_fpu = B_TRUE,
.name = "ssse3"
};

Expand Down
1 change: 1 addition & 0 deletions module/zcommon/zfs_fletcher_superscalar.c
Original file line number Diff line number Diff line change
Expand Up @@ -163,5 +163,6 @@ const fletcher_4_ops_t fletcher_4_superscalar_ops = {
.compute_byteswap = fletcher_4_superscalar_byteswap,
.fini_byteswap = fletcher_4_superscalar_fini,
.valid = fletcher_4_superscalar_valid,
.uses_fpu = B_FALSE,
.name = "superscalar"
};
1 change: 1 addition & 0 deletions module/zcommon/zfs_fletcher_superscalar4.c
Original file line number Diff line number Diff line change
Expand Up @@ -229,5 +229,6 @@ const fletcher_4_ops_t fletcher_4_superscalar4_ops = {
.compute_byteswap = fletcher_4_superscalar4_byteswap,
.fini_byteswap = fletcher_4_superscalar4_fini,
.valid = fletcher_4_superscalar4_valid,
.uses_fpu = B_FALSE,
.name = "superscalar4"
};