-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ptr::align_offset generates surprisingly bad code #75579
Comments
FWIW this is something we'd like to fix (rust-lang/miri#1074). |
Most of the function is constant-evaluatable (with the only unknown usually being the pointer value) so there is no reason it could not have a special-case (based on input types) for cases where a more optimal implementation is possible. |
Improve codegen for `align_offset` In this PR the `align_offset` implementation is changed/improved to produce better code in certain scenarios such as when pointer type is has a stride of 1 or when building for low optimisation levels. While these changes do not achieve the "ideal" codegen referenced in rust-lang#75579, it gets significantly closer to it. I’m not actually sure if the codegen can actually be much better with this function returning the offset, rather than the aligned pointer. See the descriptions for separate commits for further information.
`stride == 1` case can be computed more efficiently through `-p (mod a)`. That, then translates to a nice and short sequence of LLVM instructions: %address = ptrtoint i8* %p to i64 %negptr = sub i64 0, %address %offset = and i64 %negptr, %a_minus_one And produces pretty much ideal code-gen when this function is used in isolation. Typical use of this function will, however, involve use of the result to offset a pointer, i.e. %aligned = getelementptr inbounds i8, i8* %p, i64 %offset This still looks very good, but LLVM does not really translate that to what would be considered ideal machine code (on any target). For example that's the codegen we obtain for an unknown alignment: ; x86_64 dec rsi mov rax, rdi neg rax and rax, rsi add rax, rdi In particular negating a pointer is not something that’s going to be optimised for in the design of CISC architectures like x86_64. They are much better at offsetting pointers. And so we’d love to utilize this ability and produce code that's more like this: ; x86_64 lea rax, [rsi + rdi - 1] neg rsi and rax, rsi To achieve this we need to give LLVM an opportunity to apply its various peep-hole optimisations that it does during DAG selection. In particular, the `and` instruction appears to be a major inhibitor here. We cannot, sadly, get rid of this load-bearing operation, but we can reorder operations such that LLVM has more to work with around this instruction. One such ordering is proposed in rust-lang#75579 and results in LLVM IR that looks broadly like this: ; using add enables `lea` and similar CISCisms %offset_ptr = add i64 %address, %a_minus_one %mask = sub i64 0, %a %masked = and i64 %offset_ptr, %mask ; can be folded with `gepi` that may follow %offset = sub i64 %masked, %address …and generates the intended x86_64 machine code. One might also wonder how the increased amount of code would impact a RISC target. Turns out not much: ; aarch64 previous ; aarch64 new sub x8, x1, #1 add x8, x1, x0 neg x9, x0 sub x8, x8, #1 and x8, x9, x8 neg x9, x1 add x0, x0, x8 and x0, x8, x9 (and similarly for ppc, sparc, mips, riscv, etc) The only target that seems to do worse is… wasm32. Onto actual measurements – the best way to evaluate snippets like these is to use llvm-mca. Much like Aarch64 assembly would allow to suspect, there isn’t any performance difference to be found. Both snippets execute in same number of cycles for the CPUs I tried. On x86_64, we get throughput improvement of >50%, however!
I happened to look at more closely at the generated assembly for some code that uses
align_offset
, and noticed...align_offset
does not compile as efficiently as one might hope for the case of "align pointer tosize_of::<usize>()
"For example (ignore that I omit handling of align_offset's error return value):
compiles to
Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):
compiles to
Which is substantially better along a variety of metrics, including but not limited to actual runtime.
Taking a look at the source for
align_offset
reveals that it uh, well it does some stuff.rust/library/core/src/ptr/mod.rs
Lines 1166 to 1271 in ac48e62
p + USIZE - 1
wraps around the address space but the aligned value wouldn't).Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).
So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when
size_of::<T> == 1
... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of
align_offset
's secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?P.S. Godbolt link: https://godbolt.org/z/388Enf
The text was updated successfully, but these errors were encountered: