Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memcpy and co. are really unoptimized #339

Open
CryZe opened this issue Feb 7, 2020 · 14 comments
Open

memcpy and co. are really unoptimized #339

CryZe opened this issue Feb 7, 2020 · 14 comments

Comments

@CryZe
Copy link
Contributor

CryZe commented Feb 7, 2020

I've seen the assembly of the memcpy here for both powerpc and wasm (where the one here seems to be used unless you use the WASI target) and in both cases really simplistic byte wise loops are emitted. I haven't really done any benchmarks but these likely perform worse than more optimized loops that work on 32-bit or so at a time.

@zhaofengli
Copy link

Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs

@josephlr
Copy link
Contributor

josephlr commented May 18, 2020

Redox (MIT-licensed) provides better pure-Rust implementations of those functions at https://gitlab.redox-os.org/redox-os/kernel/-/blob/master/src/externs.rs

I would caution against using Redox's (current) memcpy implementation as it relies on undefined behavior (doing an unaligned load on a usize). See: https://gitlab.redox-os.org/redox-os/kernel/issues/99. This is UB on all platforms, even those that allow for unaligned loads, and will immediately fault on platforms that ban unaligned loads (like ARM).

If you need an optimized implementation, you should probably just link against a libc that provides them. Example of how to do this for musl on bare-metal x86_64: rust-osdev/cargo-xbuild#77 (comment)

@pca006132
Copy link
Contributor

May I ask that is there a way to specify other memcpy implementation? I'm on ARM platform, and the memcpy is not very efficient for my use case as I have many [u8] slices that have to be copied around.

I've tried to modify the target triple to not include the name -none to circumvent issue #369 , and managed to link against the ARM GCC Embedded Toolchain for the required memcpy and related symbols. However, the actual output is still linked to the compiler builtins implementation. It seems that the code to copy slices is still calling the __aeabi_* functions which uses the implementation defined in mem.rs.

#[cfg(not(target_os = "ios"))]
#[cfg_attr(not(feature = "mangled-names"), no_mangle)]
#[cfg_attr(thumb, linkage = "weak")]
pub unsafe extern "aapcs" fn __aeabi_memcpy(dest: *mut u8, src: *const u8, n: usize) {
::mem::memcpy(dest, src, n);
}

Related: #253

@josephlr
Copy link
Contributor

josephlr commented Oct 2, 2020

@pca006132 I think similar to #365, we could also add optimized variants of memcpy and friends for aarch64. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs could ink to the extern symbol memcpy instead of ::mem::memcpy, but improving the implementation in this crate is probably the best bet.

EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy, so there's not much we can do about that.

@pca006132
Copy link
Contributor

@pca006132 I think similar to #365, we could also add optimized variants of memcpy and friends for aarch64. I'm not sure what the best implementation would be for that architecture. Alternatively, the code in arm.rs could ink to the extern symbol memcpy instead of ::mem::memcpy, but improving the implementation in this crate is probably the best bet.

EDIT: It would seem that LLVM on AArch64 emits calls to __aeabi_memcpy, so there's not much we can do about that.

It is good to have optimized variants for some more targets. However, I think embedded systems may have some different requirements than normal applications, and the best way would be to allow them to provide their own implementation if they have to. For example, for armv7a processors, we could use NEON optimized memcpy if we enabled the FPU, but there may be bare-metal programs running before enabling the FPU, so probably they want an unoptimized memcpy implementation which does not use NEON.

I think the easiest way is to use weak linkage, so users can replace it with their own implementation if they want, such as using the implementation from newlib. #378

@Demindiro
Copy link
Contributor

I've experimented a bit with faster mem* functions on a Ryzen 2700X and a i5-5287U and noted the following:

  • For small copies/sets it is significantly faster to do two unaligned load stores like so:
#[inline]
unsafe fn small_copy(dest: *mut u8, src: *const u8, count: usize) {
    if count < 2 {
        *dest = *src;
        return
    }
    if count <= 4 {
        let a = src.cast::<u16>().read_unaligned();
        let b = src.add(count - 2).cast::<u16>().read_unaligned();
        dest.cast::<u16>().write_unaligned(a);
        dest.add(count - 2).cast::<u16>().write_unaligned(b);
        return
    }
    if count <= 8 {
        let a = src.cast::<u32>().read_unaligned();
        let b = src.add(count - 4).cast::<u32>().read_unaligned();
        dest.cast::<u32>().write_unaligned(a);
        dest.add(count - 4).cast::<u32>().write_unaligned(b);
        return
    }
    if count <= 16 {
        let a = src.cast::<u64>().read_unaligned();
        let b = src.add(count - 8).cast::<u64>().read_unaligned();
        dest.cast::<u64>().write_unaligned(a);
        dest.add(count - 8).cast::<u64>().write_unaligned(b);
        return
    }
    if count <= 32 {
        let a = src.cast::<u128>().read_unaligned();
        let b = src.add(count - 16).cast::<u128>().read_unaligned();
        dest.cast::<u128>().write_unaligned(a);
        dest.add(count - 16).cast::<u128>().write_unaligned(b);
        return
    }
}
  • For large copies it is faster to use non-temporal stores. This is probably correlated with L2 cache size. This does not seem to apply to rep stos.
  • It's probably worth checking & caching CPUID for ERMS, FSRM et al. and use only rep movsb when applicable.

My main concern is that these optimizations can significantly bloat the mem* functions, so while the performance may be better in benchmarks it may be worse in practice since it uses much more i-cache space.

@pca006132
Copy link
Contributor

iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.

@Demindiro
Copy link
Contributor

iirc LLVM already emits small copy operation as a set of load/store instead of calling memcpy. Not sure what is the threshold though.

AFAICT it only does if it knows the size beforehand: https://rust.godbolt.org/z/KK7xj6Goh

@pca006132
Copy link
Contributor

I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.

@bjorn3
Copy link
Member

bjorn3 commented Jul 7, 2022

For large copies it is faster to use non-temporal stores.

Non-temporal stores bypass the cache, so it is expected that they are faster. Unfortunately this also means that unless you take care to ensure that the target of the write wasn't in the cache in the first place, you can get inconsistent results as future reads may read it from the cache again and I think if the cacheline is marked as modified that flushing it would overwrite the non-temporal store.

I think using the implementation in glibc is probably a better choice here, they are well optimized and are already using things like vectorization etc.

  1. Glibs is LGPL licensed.
  2. These memory operation implementations in compiler-builtins are only used on embedded systems where size is important and on wasm where you won't be able to vectorize much anyway. (The simd feature only gives 128bit vectors and in most cases rust wasm programs are compiled without simd support.)

@josephlr
Copy link
Contributor

We could also add an optimized version using RISC-V's Vector Extension (aka the V Extension). This idea was explored in this Reddit post and this writeup. It seems fairly fast and simple.

If we do:

asm!(
  "vsetvli  {n}, {count}, e8, m8, ta, ma",
  "vle8.v   v0, ({src})",
  "vse8.v   v0, ({dest})",
  ...
);

in a loop, we get assembly (Godbolt link) that is nearly identical to the handwritten example of a memcpy implementation in the spec.

I'm not super familiar with the V extension, but I imagine we could have similar implementations for the other memory functions.

@bjorn3
Copy link
Member

bjorn3 commented Jul 21, 2022

It would be nice if the loop itself could be inside the asm block. Codegen backends which don't have native inline asm support (like my cg_clif) turn asm blocks into calls to a function defined in assembly compiled using an external assembler. Doing such calls in a loop has a decent amount of overhead compared to doing it once and having the loop inside the asm block.

@Lokathor
Copy link
Contributor

I've recently written some ARM memcpy implementations. Hopefully they can be PR'd into this repo at some point. The only catch is that they're handwritten a32 code and some of the newer ARM targets don't even support using a32 code at all, so they can't be instantly dropped in for all ARM targets.

if anyone wants to get started the code is over here (and all Apache-2.0 OR MIT just like normal for rust): https://github.com/Lokathor/arm7tdmi_aeabi/blob/b0861b2ec026eaad0142d75b138a2e02c6f38cb9/src/the_code.s

sethp added a commit to rustbox/esp-hal that referenced this issue May 14, 2023
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
sethp added a commit to rustbox/esp-hal that referenced this issue May 14, 2023
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
sethp added a commit to rustbox/esp-hal that referenced this issue May 14, 2023
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:

```asm
            let mut prios = [0u128; 16];
40380d44:       ec840513                addi    a0,s0,-312
40380d48:       10000613                li      a2,256
40380d4c:       ec840b93                addi    s7,s0,-312
40380d50:       4581                    li      a1,0
40380d52:       01c85097                auipc   ra,0x1c85
40380d56:       11e080e7                jalr    286(ra) # 42005e70 <memset>
```

and

```asm
            prios
40380f9c:       dc840513                addi    a0,s0,-568
40380fa0:       ec840593                addi    a1,s0,-312
40380fa4:       10000613                li      a2,256
40380fa8:       dc840493                addi    s1,s0,-568
40380fac:       01c85097                auipc   ra,0x1c85
40380fb0:       eae080e7                jalr    -338(ra) # 42005e5a <memcpy>
```

As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.

In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:

```readelf
   Num:    Value  Size Type    Bind   Vis      Ndx Name
 27071: 42005f72    22 FUNC    LOCAL  HIDDEN     3 memcpy
 27072: 42005f88    22 FUNC    LOCAL  HIDDEN     3 memset
 28853: 42005f9e   186 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memcpy
 28854: 42006058   110 FUNC    LOCAL  HIDDEN     3 compiler_builtins::mem::memset
```

NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.

Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."

However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.

[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
@ChrisDenton
Copy link
Member

For very small memcpys, could we not just use a couple of registers like llvm does? This should not much affect binary size, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants