ptr::align_offset generates surprisingly bad code #75579

thomcc · 2020-08-16T05:24:05Z

I happened to look at more closely at the generated assembly for some code that uses align_offset, and noticed... align_offset does not compile as efficiently as one might hope for the case of "align pointer to size_of::<usize>()"

For example (ignore that I omit handling of align_offset's error return value):

pub unsafe fn align_offset(p: *const u8) -> *const u8 {
    p.add(p.align_offset(core::mem::size_of::<usize>()))
}

compiles to

example::align_offset:
        movl    %edi, %ecx
        andl    $7, %ecx
        movl    $8, %eax
        subq    %rcx, %rax
        testq   %rcx, %rcx
        cmoveq  %rcx, %rax
        addq    %rdi, %rax
        retq

Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):

pub unsafe fn manual_align_offset(p: *const u8) -> *const u8 {
    // IIRC just doing `arithmetic(p as usize) as *const u8` makes LLVM sad
    let aligned = ((p as usize + USIZE_SIZE - 1) & !(USIZE_SIZE - 1)) as isize;
    p.offset(aligned - (p as isize))
}

compiles to

example::manual_align_offset:
        leaq    7(%rdi), %rax
        andq    $-8, %rax
        retq

Which is substantially better along a variety of metrics, including but not limited to actual runtime.

Taking a look at the source for align_offset reveals that it uh, well it does some stuff.

rust/library/core/src/ptr/mod.rs

Lines 1166 to 1271 in ac48e62

    
           /// Any questions go to @nagisa. 
        
           #[lang = "align_offset"] 
        
           pub(crate) unsafe fn align_offset<T: Sized>(p: *const T, a: usize) -> usize { 
        
               /// Calculate multiplicative modular inverse of `x` modulo `m`. 
        
               /// 
        
               /// This implementation is tailored for align_offset and has following preconditions: 
        
               /// 
        
               /// * `m` is a power-of-two; 
        
               /// * `x < m`; (if `x ≥ m`, pass in `x % m` instead) 
        
               /// 
        
               /// Implementation of this function shall not panic. Ever. 
        
               #[inline] 
        
               fn mod_inv(x: usize, m: usize) -> usize { 
        
                   /// Multiplicative modular inverse table modulo 2⁴ = 16. 
        
                   /// 
        
                   /// Note, that this table does not contain values where inverse does not exist (i.e., for 
        
                   /// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.) 
        
                   const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15]; 
        
                   /// Modulo for which the `INV_TABLE_MOD_16` is intended. 
        
                   const INV_TABLE_MOD: usize = 16; 
        
                   /// INV_TABLE_MOD² 
        
                   const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD; 
        
                   let table_inverse = INV_TABLE_MOD_16[(x & (INV_TABLE_MOD - 1)) >> 1] as usize; 
        
                   if m <= INV_TABLE_MOD { 
        
                       table_inverse & (m - 1) 
        
                   } else { 
        
                       // We iterate "up" using the following formula: 
        
                       // 
        
                       // $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$ 
        
                       // 
        
                       // until 2²ⁿ ≥ m. Then we can reduce to our desired `m` by taking the result `mod m`. 
        
                       let mut inverse = table_inverse; 
        
                       let mut going_mod = INV_TABLE_MOD_SQUARED; 
        
                       loop { 
        
                           // y = y * (2 - xy) mod n 
        
                           // 
        
                           // Note, that we use wrapping operations here intentionally – the original formula 
        
                           // uses e.g., subtraction `mod n`. It is entirely fine to do them `mod 
        
                           // usize::MAX` instead, because we take the result `mod n` at the end 
        
                           // anyway. 
        
                           inverse = inverse.wrapping_mul(2usize.wrapping_sub(x.wrapping_mul(inverse))); 
        
                           if going_mod >= m { 
        
                               return inverse & (m - 1); 
        
                           } 
        
                           going_mod = going_mod.wrapping_mul(going_mod); 
        
                       } 
        
                   } 
        
               } 
        
               let stride = mem::size_of::<T>(); 
        
               let a_minus_one = a.wrapping_sub(1); 
        
               let pmoda = p as usize & a_minus_one; 
        
               if pmoda == 0 { 
        
                   // Already aligned. Yay! 
        
                   return 0; 
        
               } 
        
               if stride <= 1 { 
        
                   return if stride == 0 { 
        
                       // If the pointer is not aligned, and the element is zero-sized, then no amount of 
        
                       // elements will ever align the pointer. 
        
                       !0 
        
                   } else { 
        
                       a.wrapping_sub(pmoda) 
        
                   }; 
        
               } 
        
               let smoda = stride & a_minus_one; 
        
               // SAFETY: a is power-of-two so cannot be 0. stride = 0 is handled above. 
        
               let gcdpow = unsafe { intrinsics::cttz_nonzero(stride).min(intrinsics::cttz_nonzero(a)) }; 
        
               let gcd = 1usize << gcdpow; 
        
               if p as usize & (gcd.wrapping_sub(1)) == 0 { 
        
                   // This branch solves for the following linear congruence equation: 
        
                   // 
        
                   // ` p + so = 0 mod a ` 
        
                   // 
        
                   // `p` here is the pointer value, `s` - stride of `T`, `o` offset in `T`s, and `a` - the 
        
                   // requested alignment. 
        
                   // 
        
                   // With `g = gcd(a, s)`, and the above asserting that `p` is also divisible by `g`, we can 
        
                   // denote `a' = a/g`, `s' = s/g`, `p' = p/g`, then this becomes equivalent to: 
        
                   // 
        
                   // ` p' + s'o = 0 mod a' ` 
        
                   // ` o = (a' - (p' mod a')) * (s'^-1 mod a') ` 
        
                   // 
        
                   // The first term is "the relative alignment of `p` to `a`" (divided by the `g`), the second 
        
                   // term is "how does incrementing `p` by `s` bytes change the relative alignment of `p`" (again 
        
                   // divided by `g`). 
        
                   // Division by `g` is necessary to make the inverse well formed if `a` and `s` are not 
        
                   // co-prime. 
        
                   // 
        
                   // Furthermore, the result produced by this solution is not "minimal", so it is necessary 
        
                   // to take the result `o mod lcm(s, a)`. We can replace `lcm(s, a)` with just a `a'`. 
        
                   let a2 = a >> gcdpow; 
        
                   let a2minus1 = a2.wrapping_sub(1); 
        
                   let s2 = smoda >> gcdpow; 
        
                   let minusp2 = a2.wrapping_sub(pmoda >> gcdpow); 
        
                   return (minusp2.wrapping_mul(mod_inv(s2, a2))) & a2minus1; 
        
               } 
        
               // Cannot be aligned at all. 
        
               usize::MAX 
        
           }

... Anyway I'm just gonna take at face value that this all is needed since it tracks that sometimes you might need to GCD... And hell, maybe mine misses a case like "p + USIZE - 1 wraps around the address space but the aligned value wouldn't).

Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).

So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when size_of::<T> == 1... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).

Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of align_offset's secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?

P.S. Godbolt link: https://godbolt.org/z/388Enf

The text was updated successfully, but these errors were encountered:

RalfJung · 2020-08-16T08:42:11Z

miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned...

FWIW this is something we'd like to fix (rust-lang/miri#1074).

nagisa · 2020-08-16T12:26:52Z

is this fixable?

Most of the function is constant-evaluatable (with the only unknown usually being the pointer value) so there is no reason it could not have a special-case (based on input types) for cases where a more optimal implementation is possible.

Improve codegen for `align_offset` In this PR the `align_offset` implementation is changed/improved to produce better code in certain scenarios such as when pointer type is has a stride of 1 or when building for low optimisation levels. While these changes do not achieve the "ideal" codegen referenced in rust-lang#75579, it gets significantly closer to it. I’m not actually sure if the codegen can actually be much better with this function returning the offset, rather than the aligned pointer. See the descriptions for separate commits for further information.

`stride == 1` case can be computed more efficiently through `-p (mod a)`. That, then translates to a nice and short sequence of LLVM instructions: %address = ptrtoint i8* %p to i64 %negptr = sub i64 0, %address %offset = and i64 %negptr, %a_minus_one And produces pretty much ideal code-gen when this function is used in isolation. Typical use of this function will, however, involve use of the result to offset a pointer, i.e. %aligned = getelementptr inbounds i8, i8* %p, i64 %offset This still looks very good, but LLVM does not really translate that to what would be considered ideal machine code (on any target). For example that's the codegen we obtain for an unknown alignment: ; x86_64 dec rsi mov rax, rdi neg rax and rax, rsi add rax, rdi In particular negating a pointer is not something that’s going to be optimised for in the design of CISC architectures like x86_64. They are much better at offsetting pointers. And so we’d love to utilize this ability and produce code that's more like this: ; x86_64 lea rax, [rsi + rdi - 1] neg rsi and rax, rsi To achieve this we need to give LLVM an opportunity to apply its various peep-hole optimisations that it does during DAG selection. In particular, the `and` instruction appears to be a major inhibitor here. We cannot, sadly, get rid of this load-bearing operation, but we can reorder operations such that LLVM has more to work with around this instruction. One such ordering is proposed in rust-lang#75579 and results in LLVM IR that looks broadly like this: ; using add enables `lea` and similar CISCisms %offset_ptr = add i64 %address, %a_minus_one %mask = sub i64 0, %a %masked = and i64 %offset_ptr, %mask ; can be folded with `gepi` that may follow %offset = sub i64 %masked, %address …and generates the intended x86_64 machine code. One might also wonder how the increased amount of code would impact a RISC target. Turns out not much: ; aarch64 previous ; aarch64 new sub x8, x1, #1 add x8, x1, x0 neg x9, x0 sub x8, x8, #1 and x8, x9, x8 neg x9, x1 add x0, x0, x8 and x0, x8, x9 (and similarly for ppc, sparc, mips, riscv, etc) The only target that seems to do worse is… wasm32. Onto actual measurements – the best way to evaluate snippets like these is to use llvm-mca. Much like Aarch64 assembly would allow to suspect, there isn’t any performance difference to be found. Both snippets execute in same number of cycles for the CPUs I tried. On x86_64, we get throughput improvement of >50%, however!

thomcc added the C-bug Category: This is a bug. label Aug 16, 2020

jonas-schievink added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 16, 2020

nagisa mentioned this issue Aug 16, 2020

Improve codegen for align_offset #75600

Merged

nagisa mentioned this issue Aug 20, 2020

Optimise align_offset for stride=1 further #75728

Merged

bors closed this as completed in 69e68cf Oct 26, 2020

thomcc mentioned this issue Jul 2, 2022

align_offset seems to generate worse code than is desirable (unless size_of::<T>() == 1) #98809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ptr::align_offset generates surprisingly bad code #75579

ptr::align_offset generates surprisingly bad code #75579

thomcc commented Aug 16, 2020 •

edited

Loading

RalfJung commented Aug 16, 2020

nagisa commented Aug 16, 2020 •

edited

Loading

ptr::align_offset generates surprisingly bad code #75579

ptr::align_offset generates surprisingly bad code #75579

Comments

thomcc commented Aug 16, 2020 • edited Loading

RalfJung commented Aug 16, 2020

nagisa commented Aug 16, 2020 • edited Loading

thomcc commented Aug 16, 2020 •

edited

Loading

nagisa commented Aug 16, 2020 •

edited

Loading