-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a special case for align_offset /w stride != 1 #98866
Conversation
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
(rust-highfive has picked a reviewer for you, use r? to override) |
f495c37
to
3d98adb
Compare
let byte_offset = wrapping_sub(aligned_address, addr); | ||
// SAFETY: `stride` is non-zero. This is guaranteed to divide exactly as well, because | ||
// addr has been verified to be aligned to the original type’s alignment requirements. | ||
unsafe { exact_div(byte_offset, stride) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again, there may be a better way to compute the element offset in the first place rather than dividing the byte offset. I believe this won’t regress stride == 1
case anyway and this division won’t actually appear in most of the code that immediately passes the result from this function to an offset
, either so I wouldn’t consider it a blocker for the time being.
r=me modulo perf: @bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 3d98adb558215e0f5078c33d60760bad1295481b with merge da5de824feffc6dd3116d84c8dbc8083dcd61358... |
It might also make sense to add an asm or LLVM test verifying the optimization here is working, so that we don't unintentionally regress in the future. |
☀️ Try build successful - checks-actions |
Queued da5de824feffc6dd3116d84c8dbc8083dcd61358 with parent 87588a2, future comparison URL. |
Finished benchmarking commit (da5de824feffc6dd3116d84c8dbc8083dcd61358): comparison url. Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)Results
CyclesResults
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Footnotes |
I'll hold off on approving in case you want to add the codegen test:
but otherwise r=me, perf looks neutral (but not unexpected, this kind of thing is likely to have marginal impact at best for something as large as rustc). |
3d98adb
to
36a96bb
Compare
@bors r=Mark-Simulacrum |
📌 Commit 36a96bb6f67d2c48a36988d742cba02003eaab98 has been approved by It is now in the queue for this repository. |
⌛ Testing commit 36a96bb6f67d2c48a36988d742cba02003eaab98 with merge fc0ddaf7ea2f0db718d1bdb7693eabf84d197611... |
💔 Test failed - checks-actions |
36a96bb
to
fe2a05e
Compare
@bors r=Mark-Simulacrum |
📌 Commit fe2a05e53ef813151f6314a8bfa57c790fafec6e has been approved by It is now in the queue for this repository. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This generalizes the previous `stride == 1` special case to apply to any situation where the requested alignment is divisible by the stride. This in turn allows the test case from rust-lang#98809 produce ideal assembly, along the lines of: leaq 15(%rdi), %rax andq $-16, %rax This also produces pretty high quality code for situations where the alignment of the input pointer isn’t known: pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 { slice.offset(slice.align_offset(16) as isize) } // => movl %edi, %eax andl $3, %eax leaq 15(%rdi), %rcx andq $-16, %rcx subq %rdi, %rcx shrq $2, %rcx negq %rax sbbq %rax, %rax orq %rcx, %rax leaq (%rdi,%rax,4), %rax Here LLVM is smart enough to replace the `usize::MAX` special case with a branch-less bitwise-OR approach, where the mask is constructed using the neg and sbb instructions. This appears to work across various architectures I’ve tried. This change ends up introducing more branches and code in situations where there is less knowledge of the arguments. For example when the requested alignment is entirely unknown. This use-case was never really a focus of this function, so I’m not particularly worried, especially since llvm-mca is saying that the new code is still appreciably faster, despite all the new branching. Fixes rust-lang#98809. Sadly, this does not help with rust-lang#72356.
fe2a05e
to
62a182c
Compare
@bors r=Mark-Simulacrum The test is definitely turning out to be as finicky as I feared it would be. |
☀️ Test successful - checks-actions |
Finished benchmarking commit (db41351): comparison url. Instruction count
Max RSS (memory usage)Results
CyclesResults
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Footnotes |
Given this is a change in library code, and it only impacts one secondary benchmark (deeply-nested-multi) , I'm going to mark this as triaged. @rustbot label: +perf-regression-triaged Edit: it looks like it's just noise, corrected from the previous run. |
This generalizes the previous
stride == 1
special case to apply to anysituation where the requested alignment is divisible by the stride. This
in turn allows the test case from #98809 produce ideal assembly, along
the lines of:
This also produces pretty high quality code for situations where the
alignment of the input pointer isn’t known:
Here LLVM is smart enough to replace the
usize::MAX
special case witha branch-less bitwise-OR approach, where the mask is constructed using
the neg and sbb instructions. This appears to work across various
architectures I’ve tried.
This change ends up introducing more branches and code in situations
where there is less knowledge of the arguments. For example when the
requested alignment is entirely unknown. This use-case was never really
a focus of this function, so I’m not particularly worried, especially
since llvm-mca is saying that the new code is still appreciably faster,
despite all the new branching.
Fixes #98809.
Sadly, this does not help with #72356.