Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memcpy, memmove and memset #405

Merged
merged 5 commits into from
Aug 31, 2021
Merged

Conversation

nbdd0121
Copy link
Contributor

@nbdd0121 nbdd0121 commented Feb 9, 2021

Partially addresses #339. (memcmp is not implemented in this PR)

This PR modifies the current simple implementation of memcpy, memmove and memset with a more sophisticated one.
It will first align dest to machine-word boundary using bytewise copy.
If dest and src are co-aligned (i.e. dest and src address are congruent modular machine-word size), it will proceed to use machine-word-wide copy to copy as many bytes as possible. Remaining bytes are copied using bytewise copy.
If dest and src are not co-aligned, misaligned copying is performed by reading two adjacent machine words and assemble them together with logical shifts before writing to dest.

The existing implementation can be pretty fast on platforms with SIMD and vectorization, but falls short for those without. The code in this PR is carefully written so it is fast on those platforms while is still vectorizable.

Here are some performance numbers:

Rust Builtin
test memcpy_builtin_1048576        ... bench:      33,388 ns/iter (+/- 2,619) = 31405 MB/s
test memcpy_builtin_1048576_offset ... bench:      33,988 ns/iter (+/- 2,371) = 30851 MB/s
test memcpy_builtin_4096           ... bench:          37 ns/iter (+/- 2) = 110702 MB/s
test memcpy_builtin_4096_offset    ... bench:          37 ns/iter (+/- 1) = 110702 MB/s
test memmove_builtin_1048576       ... bench:      28,862 ns/iter (+/- 1,013) = 36330 MB/s
test memmove_builtin_4096          ... bench:          38 ns/iter (+/- 1) = 107789 MB/s
test memset_builtin_1048576        ... bench:      21,171 ns/iter (+/- 1,369) = 49528 MB/s
test memset_builtin_1048576_offset ... bench:      21,516 ns/iter (+/- 906) = 48734 MB/s
test memset_builtin_4096           ... bench:          62 ns/iter (+/- 0) = 66064 MB/s
test memset_builtin_4096_offset    ... bench:          68 ns/iter (+/- 2) = 60235 MB/s
New Implementation
test memcpy_rust_1048576           ... bench:      40,900 ns/iter (+/- 2,602) = 25637 MB/s
test memcpy_rust_1048576_offset    ... bench:      46,575 ns/iter (+/- 2,994) = 22513 MB/s
test memcpy_rust_4096              ... bench:          76 ns/iter (+/- 4) = 53894 MB/s
test memcpy_rust_4096_offset       ... bench:          95 ns/iter (+/- 3) = 43115 MB/s
test memmove_rust_1048576          ... bench:      50,363 ns/iter (+/- 2,829) = 20820 MB/s
test memmove_rust_4096             ... bench:         176 ns/iter (+/- 6) = 23272 MB/s
test memset_rust_1048576           ... bench:      27,171 ns/iter (+/- 1,168) = 38591 MB/s
test memset_rust_1048576_offset    ... bench:      36,623 ns/iter (+/- 1,418) = 28631 MB/s
test memset_rust_4096              ... bench:          74 ns/iter (+/- 5) = 55351 MB/s
test memset_rust_4096_offset       ... bench:          95 ns/iter (+/- 2) = 43115 MB/s
Old Implementation
test memcpy_rust_1048576           ... bench:      43,626 ns/iter (+/- 5,963) = 24035 MB/s
test memcpy_rust_1048576_offset    ... bench:      47,138 ns/iter (+/- 4,029) = 22244 MB/s
test memcpy_rust_4096              ... bench:          73 ns/iter (+/- 3) = 56109 MB/s
test memcpy_rust_4096_offset       ... bench:          90 ns/iter (+/- 3) = 45511 MB/s
test memmove_rust_1048576          ... bench:     305,743 ns/iter (+/- 12,242) = 3429 MB/s
test memmove_rust_4096             ... bench:       1,071 ns/iter (+/- 60) = 3824 MB/s
test memset_rust_1048576           ... bench:      28,485 ns/iter (+/- 2,085) = 36811 MB/s
test memset_rust_1048576_offset    ... bench:      36,999 ns/iter (+/- 316) = 28340 MB/s
test memset_rust_4096              ... bench:          69 ns/iter (+/- 0) = 59362 MB/s
test memset_rust_4096_offset       ... bench:          90 ns/iter (+/- 9) = 45511 MB/s
New, with no-vectorize-loops
test memcpy_rust_1048576           ... bench:      70,713 ns/iter (+/- 24,368) = 14828 MB/s
test memcpy_rust_1048576_offset    ... bench:      68,340 ns/iter (+/- 5,622) = 15343 MB/s
test memcpy_rust_4096              ... bench:         272 ns/iter (+/- 15) = 15058 MB/s
test memcpy_rust_4096_offset       ... bench:         279 ns/iter (+/- 7) = 14681 MB/s
test memmove_rust_1048576          ... bench:      68,726 ns/iter (+/- 3,034) = 15257 MB/s
test memmove_rust_4096             ... bench:         275 ns/iter (+/- 10) = 14894 MB/s
test memset_rust_1048576           ... bench:      42,382 ns/iter (+/- 1,542) = 24741 MB/s
test memset_rust_1048576_offset    ... bench:      42,585 ns/iter (+/- 1,722) = 24623 MB/s
test memset_rust_4096              ... bench:         145 ns/iter (+/- 5) = 28248 MB/s
test memset_rust_4096_offset       ... bench:         146 ns/iter (+/- 7) = 28054 MB/s
Old, with no-vectorize-loops
test memcpy_rust_1048576           ... bench:     306,552 ns/iter (+/- 19,395) = 3420 MB/s
test memcpy_rust_1048576_offset    ... bench:     311,564 ns/iter (+/- 14,247) = 3365 MB/s
test memcpy_rust_4096              ... bench:       1,076 ns/iter (+/- 63) = 3806 MB/s
test memcpy_rust_4096_offset       ... bench:       1,074 ns/iter (+/- 30) = 3813 MB/s
test memmove_rust_1048576          ... bench:     315,208 ns/iter (+/- 158,448) = 3326 MB/s
test memmove_rust_4096             ... bench:       1,077 ns/iter (+/- 33) = 3803 MB/s
test memset_rust_1048576           ... bench:     303,456 ns/iter (+/- 7,937) = 3455 MB/s
test memset_rust_1048576_offset    ... bench:     304,886 ns/iter (+/- 21,495) = 3439 MB/s
test memset_rust_4096              ... bench:       1,070 ns/iter (+/- 30) = 3828 MB/s
test memset_rust_4096_offset       ... bench:       1,064 ns/iter (+/- 99) = 3849 MB/s

All above numbers are tested on my Coffee Lake laptop, with no-asm feature on.

The number shows that if vectorization is enabled or supported on the platform, then memcpy and memset performance of the old and the new implementation are similar (while much faster on memmove, it seems the compiler had a hard time vectorizing backward copy). If there is no vectorization or the platform does not supported misaligned access, the new implementation will have very significant performance gain (>4x with the number above, and on in-order scaler cores you'd expect 3~8x improvement depending on whether access is aligned and the width of usize).

I inspected the code generated assembly for RISC-V and it seems pretty close in quality to my hand optimsied assembly.

@nbdd0121 nbdd0121 changed the title Add test cases for memcpy, memmove and memset for different alignment Optimize memcpy, memmove and memset Feb 9, 2021
@nbdd0121
Copy link
Contributor Author

nbdd0121 commented Feb 9, 2021

Oh, one thing to note is that the misaligned code currently makes some accesses that might be UB. For example if we are trying to copy from 0x10007-0x100017, then the code will access the entire machine-word containing the byte 0x10007 (thus also 0x10000-0x10006) and the machine-word containing 0x10010-0x10016 (thus 0x100017) which are out of the bound supplied to memcpy/memmove.

I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble.

@nbdd0121
Copy link
Contributor Author

nbdd0121 commented Feb 9, 2021

Oh big-endian ISAs 🤦, will fix sooon

@bjorn3
Copy link
Member

bjorn3 commented Feb 9, 2021

I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble.

On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly.

@nbdd0121
Copy link
Contributor Author

nbdd0121 commented Feb 9, 2021

On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly.

Hmm, interesting. Platforms with these kind of protection features I know of usually track at allocation and deallocation, and these memory regions will be machine-word aligned. The oob access here is strictly just accessing the machine-word that contains an accessible byte.

@nbdd0121
Copy link
Contributor Author

nbdd0121 commented Feb 9, 2021

Anyway I could force at least 7 bytes to be copied bytewise before and after the unaligned copy, and doing so would be UB-free. But that would hurt other platforms where this specific kind of OOB access is permitted.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.

src/mem/impls.rs Outdated Show resolved Hide resolved
src/mem/impls.rs Show resolved Hide resolved
src/mem/impls.rs Outdated
while i < n {
*dest.add(i) = *src.add(i);
i += 1;
unsafe fn copy_forward_aligned_words(dest: *mut u8, src: *const u8, n: usize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is taking aligned words, could this perhaps take usize pointers? Additionally should this take u64 pointers perhaps? Or maybe even u128?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but I uses u8 pointer to keep the signature identical to other functions.

I thought about using u64 or u128 here but it turns out to be bad for a few reason:

  • Many platforms, especially those which don't have vectorization, can do at most usize load per instruction.
  • It requires more alignment than usize, so more bytes have to be copied with bytewise copy.
  • It makes the misalignment scenario bad. Shifting and reassemble 2usize actually generates worse code than 1usize.

However, one optimisation that I haven't include in this PR is to unroll the loop for the aligned scenario. If n is a multiple of 8*usize for example, then we can have a 8-unrolled loop followed by a loop for the remainder. On clang this could be done easily with #pragma unroll but I don't think it is currently possible in Rust so we might have to be it manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like taking usize here would make the intent clearer where it's copying based on each word. Perhaps there could be a typedef in the module for the copy size? Some platforms may have better performance on some types than others, so I could see it being conditional as well.

For unrolling I would expect LLVM to do this automatically if supported.

src/mem/impls.rs Outdated Show resolved Hide resolved
if likely(n >= WORD_COPY_THRESHOLD) {
// Align dest
// Because of n >= 2 * WORD_SIZE, dst_misalignment < n
let dest_misalignment = (dest as usize).wrapping_neg() & WORD_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use the align_offset method here to avoid bit-tricks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From align_offset's doc:

If it is not possible to align the pointer, the implementation returns usize::MAX. It is permissible for the implementation to always return usize::MAX. Only your algorithm's performance can depend on getting a usable offset here, not its correctness.

So that rules out the use of align_offset. Also note that this is the very hot path and I would prefer simple bit trcisk rather to rely on LLVM to optimise out a very complex function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try benchmarking to see if there is overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not related to performance, it could simply not be used for correctness.

It is permissible for the implementation to always return usize::MAX.

if likely(src_misalignment == 0) {
copy_forward_aligned_words(dest, src, n_words);
} else {
copy_forward_misaligned_words(dest, src, n_words);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, have you tested simply copying using ptr::read_unaligned and ptr::write_unaligned? That way alignment wouldn't be an issue, but I don't know the performance impact that would have on various platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, first of all ptr::read_unaligned and ptr::write_aligned calls ptr::copy_nonoverlapping which translates to memcpy, so I would like to avoid the possibility of an recursion if LLVM doesn't optimise them away.

Secondly, ptr::read_unaligned has really poor performance. On platforms without misaligned memory access support it will translate to size_of<usize>() number of byte loads.

This branch is necessary because we don't want to bear the burden of all the shifts and checks necessary for misaligned loads if dest and src are perfectly co-aligned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried this out and seen infinite recursion? Have you tried it out and seen if it's slower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There'll be an infinite recursion if compiled in debug mode.

On architectures that does not support misaligned loads (so most ISAs other than armv8 and x86/64) the performance is much slower because it generates 8 byte load and 16 bit ops rather than 1 word load and 3 bit ops.

src/mem/impls.rs Outdated Show resolved Hide resolved
src/mem/impls.rs Outdated
while dest < dest_end {
*dest = *src;
dest = dest.add(1);
src = src.add(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LLVM has a bunch of heuristics for auto-vectorizing loops like these, and the auto-vectorized code is probably faster as well.

LLVM doesn't auto-vectorize on all platforms, though. What I'm worried about though is that this is hurting performance on platforms that auto-vectorize. Would it be possible to survey some platforms to see if the simple function implementations are enough with LLVM auto-vectorizing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check the benchmarks on x86-64. The code submitted in this PR is actually my second attempt, my first attempt ended up in poor performance because it does not vectorize very well.

The current design that computes dest_end inside these helper functions are a result of learning from the failed attempt. On platforms that vectorizes, the code is simple enough to be detected by LLVM so it effectively treat it as the while i < n type of loop so there won't be a regression, while on those that does not vectorize LLVM wouldn't bookkeepthe i variable (yes, LLVM couldn't optimise away i) so that's like 20% performance improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the purpose of benchmarking on x86_64? This file isn't even used on that platform and it seems like we wouldn't want to use it anyway because LLVM otherwise auto-vectorizes very well and probably outperforms what's in this file.

I'm interested in getting data on this for platforms other than x86_64 for those reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I turn off auto vectorization to mimic platforms without auto-vectorization. I did observe very significant improvement on RISC-V but I couldn't just do cargo bench on a no_std platform. Even on x86-64 the code still provides significant performance improvement for backward copy though.

@nbdd0121
Copy link
Contributor Author

Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.

I have tested it on RISC-V. The RISC-V platform that I tested it on is no_std so there isn't a bencher but it speeds up my entire program by 25%.

It could hurt platforms that have both vectorization and hardware misaligned load support, like x86-64 and ARM, but I am not sure what is the best to do there, e.g. how to detect if vectorization is on and how to make use of unaligned load support.

BTW I just realized the benches do not test misaligned scenario. I'll add a benchmark soon.

@nbdd0121
Copy link
Contributor Author

Misaligned benchmarks:

Rust Builtin
test memcpy_builtin_1048576_misalign  ... bench:      33,722 ns/iter (+/- 1,310) = 31094 MB/s
test memcpy_builtin_4096_misalign     ... bench:          38 ns/iter (+/- 1) = 107789 MB/s
test memmove_builtin_1048576_misalign ... bench:      34,791 ns/iter (+/- 1,177) = 30139 MB/s
test memmove_builtin_4096_misalign    ... bench:          38 ns/iter (+/- 1) = 107789 MB/s
New Implementation
test memcpy_rust_1048576_misalign     ... bench:      56,567 ns/iter (+/- 1,812) = 18536 MB/s
test memcpy_rust_4096_misalign        ... bench:         215 ns/iter (+/- 5) = 19051 MB/s
test memmove_rust_1048576_misalign    ... bench:     136,350 ns/iter (+/- 5,338) = 7690 MB/s
test memmove_rust_4096_misalign       ... bench:         550 ns/iter (+/- 27) = 7447 MB/s
Old Implementation
test memcpy_rust_1048576_misalign     ... bench:      49,171 ns/iter (+/- 3,514) = 21325 MB/s
test memcpy_rust_4096_misalign        ... bench:         108 ns/iter (+/- 5) = 37925 MB/s
test memmove_rust_1048576_misalign    ... bench:     306,153 ns/iter (+/- 10,100) = 3425 MB/s
test memmove_rust_4096_misalign       ... bench:       1,070 ns/iter (+/- 31) = 3828 MB/s
New, with no-vectorize-loops
test memcpy_rust_1048576_misalign     ... bench:     138,407 ns/iter (+/- 5,799) = 7576 MB/s
test memcpy_rust_4096_misalign        ... bench:         544 ns/iter (+/- 41) = 7529 MB/s
test memmove_rust_1048576_misalign    ... bench:     139,043 ns/iter (+/- 5,207) = 7541 MB/s
test memmove_rust_4096_misalign       ... bench:         551 ns/iter (+/- 8) = 7433 MB/s
Old, with no-vectorize-loops
test memcpy_rust_1048576_misalign     ... bench:     308,055 ns/iter (+/- 5,330) = 3403 MB/s
test memcpy_rust_4096_misalign        ... bench:       1,073 ns/iter (+/- 13) = 3817 MB/s
test memmove_rust_1048576_misalign    ... bench:     308,875 ns/iter (+/- 27,410) = 3394 MB/s
test memmove_rust_4096_misalign       ... bench:       1,074 ns/iter (+/- 36) = 3813 MB/s

I expected the misaligned code path hurts x86-64. It is slower with vectorization on but faster with vectorization off. It is always faster in memmove, presumably because LLVM couldn't vectorize it.

@alexcrichton
Copy link
Member

Indeed yeah I'm specifically worried where this makes existing platforms worse because LLVM otherwise does something well today. This whole PR should arguably be replaced with an improved LLVM backend for relevant targets, but that's naturally a very large undertaking so it seems fine to have this in the meantime while we wait for LLVM to improve.

I would prefer to not have a too-overly-complicated implementation since it seems likely to only get used on some platforms. Having a complicated implementation is also a risk because I don't think anything here is tested on CI since CI mostly does x86_64.

Basically I think that for this there should at least be peformance numbers for ARM/AArch64 and investigation into what LLVM does today in terms of vectorization and performance of unaligned loads/stores. Additionally it would be great to ensure that everything here is tested on CI regardless of the platform if possible.

@nbdd0121
Copy link
Contributor Author

I would argue this isn't overly complicated. The concept is very simple and is what C libraries do already.

Benchmarking on ARM/AArch64 might not be very helpful because there are both vectorization and misaligned load/store support, but I would still expect memmove performance improvement (I could test it myself because I don't have a ARM dev environment setup). This PR is mainly intended to be a much better default for those without (a quick test on godbolt suggests this includes armv7, powerpc, mips, riscv and more).

As for the CI, it should be noted a few platforms are already being tested (including big-endian ones like PowerPC).

@adamgemmell
Copy link

adamgemmell commented Aug 20, 2021

Hi, I've carried out some benchmarks on an aarch64-unknown-linux-gnu platform (Graviton 2, Neoverse-N1) to alleviate some concerns. I rebased the branch to something more recent before carrying these out.

(all numbers ns/iter. +/- ranges were never more than 1% of the result)

Test Name Result
memcpy_builtin_104857 45562
memcpy_builtin_1048576_misalign 47097
memcpy_builtin_1048576_offset 47052
memcpy_builtin_4096 108
memcpy_builtin_4096_misalign 111
memcpy_builtin_4096_offset 111
memmove_builtin_1048576 42840
memmove_builtin_1048576_misalign 62671
memmove_builtin_4096 109
memmove_builtin_4096_misalign 161
memset_builtin_1048576 26245
memset_builtin_1048576_offset 26242
memset_builtin_4096 105
memset_builtin_4096_offset 106
Test Name New New (no-vec) Old Old (no-vec)
memcpy_rust_1048576 47317 108806 46978 709112
memcpy_rust_1048576_misalign 78310 148440 56356 708612
memcpy_rust_1048576_offset 50758 109122 57212 708791
memcpy_rust_4096 113 421 109 2766
memcpy_rust_4096_misalign 287 573 136 2768
memcpy_rust_4096_offset 121 426 136 2768
memmove_rust_1048576 105688 109360 840238 630534
memmove_rust_1048576_misalign 151630 151128 840360 630725
memmove_rust_4096 392 421 3288 2469
memmove_rust_4096_misalign 573 568 3288 2469
memset_rust_1048576 26252 104967 26246 419879
memset_rust_1048576_offset 26257 104960 26247 419857
memset_rust_4096 109 420 106 1658
memset_rust_4096_offset 114 422 106 1654

The short story is the results look similar to the ones observed above on x86_64.

  • memcpy and memset are largely the same, with nice improvements for non-vectorised code. Performance reaches that of the builtins in at least some cases.
  • memmove shows large improvments. Non-vectorised performance seems comparable at least for these tests.
  • Misaligned accesses have a performance regression, though it's smaller than the improvement in memmove.

@nbdd0121
Copy link
Contributor Author

I suppose I should conditionally select different versions of code for misaligned case. For architectures (e.g. x86/ARM) that have hardware misaligned access support, it might be better to just do misaligned load instead of trying to do bitshifts.

@nbdd0121
Copy link
Contributor Author

@adamgemmell I have added a different code path for misaligned case for x86/x64/aarch64. Can you do another bench run? Thanks!

@adamgemmell
Copy link

Same machine, same branch base, same compiler (which I forgot to mention was 1.56.0-nightly (2faabf579 2021-07-27)). Master runs looked close to what I was getting previously.

Test With misaligned case With MC (no-vec)
memcpy_rust_1048576 46949 109190
memcpy_rust_1048576_misalign 55425 104213
memcpy_rust_1048576_offset 50890 109193
memcpy_rust_4096 114 421
memcpy_rust_4096_misalign 144 398
memcpy_rust_4096_offset 123 427
memmove_rust_1048576 104361 109690
memmove_rust_1048576_misalign 106661 105638
memmove_rust_4096 393 421
memmove_rust_4096_misalign 411 415
memset_rust_1048576 26254 52526
memset_rust_1048576_offset 26262 52510
memset_rust_4096 110 220
memset_rust_4096_offset 114 222

Looks like a good change, really brings the misaligned numbers in line with the other tests.

There's a weird doubling of performance for the memset no-vec case on this machine. Not complaining but I kinda want to look more into it tomorrow. Couldn't replicate something similar on other machines I tried.

@Amanieu
Copy link
Member

Amanieu commented Aug 30, 2021

You can get even better performance with unaligned memory accesses by overlapping the accesses:

  • This is easy with memset: a 7-byte memset can be done with 2 4-byte stores at [start] and [end - 4].
  • Small memcpy can be done by reading the entire data into registers with overlapping loads and then writing back everything with overlapping stores. This also allows you to skip the direction check for small memcpy since the direction becomes irrelevant.
  • For larger memcpy you can perform the first copy as a large unaligned copy and the rest using aligned copies. The trick here to avoid overlap issues is to load the data for the first aligned copy before performing the store for the initial unaligned copy. At the end of the main aligned copy loop, do an unaligned copy for the last block of data.

I recommend reading through these optimized implementations of memcpy and memset for AArch64 which use all of these tricks:

For platforms such as RISC-V without support for unaligned accesses, I'm surprised that copy_forward_misaligned_words is faster than a simple byte-by-byte copy. But if you've run the benchmarks then I guess it's fine.

@nbdd0121
Copy link
Contributor Author

You can get even better performance with unaligned memory accesses by overlapping the accesses:

  • This is easy with memset: a 7-byte memset can be done with 2 4-byte stores at [start] and [end - 4].

Small memset/memcpy/memmove isn't particularly of concern. Sure, there might be some improvements possible, but the improvement is not as significant as improvements of large memset/memcpys. Plus, if size is known, the compiler will inline them anyway.

  • Small memcpy can be done by reading the entire data into registers with overlapping loads and then writing back everything with overlapping stores. This also allows you to skip the direction check for small memcpy since the direction becomes irrelevant.
  • you can perform the first copy as a large unaligned copy and the rest using aligned copies. The trick here to avoid overlap issues is to load the data for the first aligned copy before performing the store for the initial unaligned copy. At the end of the main aligned copy loop, do an unaligned copy for the last block of data.

I don't understand how overlapping is an issue for memcpy. Do you mean memmove? For the trick to work I believe the gap between src and dst must be small.

I recommend reading through these optimized implementations of memcpy and memset for AArch64 which use all of these tricks:

The tricks applied can slow down some processors while speed up others. For example, use of branches in small memset and memcpy will slow down those with simple branch predictors because there are many data-dependent branches. On the other hand loops are much easier to predict.

For platforms such as RISC-V without support for unaligned accesses, I'm surprised that copy_forward_misaligned_words is faster than a simple byte-by-byte copy. But if you've run the benchmarks then I guess it's fine.

The number of instructions reduce from 8 byte-size load + 8 byte-size store per word copy to 1 word-size load + 1 word-size store + 3 arithmetic ops per word copy. LLVM actually wouldn't even unroll the loop, so loop overhead would hit bytewise copy even further.

While some tricks you described are interesting, my main focus is on RISC-V which wouldn't benefit from those optimisations and might not even test those code paths. I am afraid it's outside my capacity to optimise for AArch64.

src/mem/impls.rs Outdated Show resolved Hide resolved
@nbdd0121
Copy link
Contributor Author

Failed at testcrate/tests/cmp.rs:34:9. I am not sure I touched anything related to that between recent 2 pushes. Might be a change made in recent nightly?

@nbdd0121
Copy link
Contributor Author

Tried to run the workflow with current HEAD of master on my fork. Confirmed that it failed https://github.com/nbdd0121/compiler-builtins/actions/runs/1184124369.

@Amanieu
Copy link
Member

Amanieu commented Aug 31, 2021

I had a look at the powerpc failure and it seems to be an LLVM bug (I'm still working on a minimal reproduction). In the meantime I think it's fine to merge this PR.

@Amanieu Amanieu merged commit 461b5f8 into rust-lang:master Aug 31, 2021
@Amanieu
Copy link
Member

Amanieu commented Aug 31, 2021

Published in compiler-builtins 0.1.50.

sethp added a commit to rustbox/esp-hal that referenced this pull request May 14, 2023
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
sethp added a commit to rustbox/esp-hal that referenced this pull request May 14, 2023
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
sethp added a commit to rustbox/esp-hal that referenced this pull request May 14, 2023
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
@RalfJung
Copy link
Member

RalfJung commented Jan 5, 2024

Oh, one thing to note is that the misaligned code currently makes some accesses that might be UB.

They are definitely UB, and IMO this needs to be fixed. Having UB in our own runtime while we tell people "never ever have UB anywhere" is not a good look.
Tracked at #559.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants