-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add vcgez, vcgtz, vclez, vcltz neon instructions #1069
Conversation
r? @Amanieu (rust-highfive has picked a reviewer for you, use r? to override) |
crates/stdarch-gen/neon.spec
Outdated
multi_fn = fixed, c:in_t | ||
multi_fn = fixed_2, d:in_t | ||
multi_fn = simd_shr, e:, a, transmute(c) | ||
multi_fn = simd_xor, transmute(e), transmute(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use simd_ge
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to be consistent with Clang's implementation. The following is the test I did in https://godbolt.org/:
#include <arm_neon.h>
int test() {
return (int) vcgez_s32;
}
And the Output:
define dso_local i32 @test() local_unnamed_addr #0 {
ret i32 ptrtoint (<2 x i32> (<2 x i32>)* @vcgez_s32 to i32)
}
define internal <2 x i32> @vcgez_s32(<2 x i32> %0) #1 {
%2 = ashr <2 x i32> %0, <i32 31, i32 31>
%3 = xor <2 x i32> %2, <i32 -1, i32 -1>
ret <2 x i32> %3
}
attributes #0 = { norecurse nounwind readnone uwtable willreturn "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon" }
attributes #1 = { alwaysinline norecurse nounwind readnone uwtable willreturn "frame-pointer"="non-leaf" "min-legal-vector-width"="64" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon" }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you compile with -O0
you will see that Clang actually emits an icmp sge
. LLVM optimizations are then turning this into a shift + xor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The url of godbolt is from here: #148
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That usually works, but in this particular case it gives a different result because the IR is not the one generated by Clang directly: it is the IR after LLVM has run optimization passes that expand the icmp eq
into shift and xor.
You can use simd_ge
in Rust and it will produce the same IR as Clang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm.. That's right. If we use an implementation consistent with -O0, can we ensure that LLVM achieves the same optimization? If so, we should indeed use simd_ge IMO
[Edit] OK, got it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run the same LLVM passes as Clang (mostly) so rustc will also transform simd_ge
into a shift + xor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation!
crates/stdarch-gen/neon.spec
Outdated
/// Compare signed less than zero | ||
name = vcltz | ||
multi_fn = fixed, b:in_t | ||
multi_fn = simd_shr, c:in_t, a, transmute(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And simd_lt
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, the following is my test in https://godbolt.org/:
#include <arm_neon.h>
int test() {
return (int) vcltz_s32;
}
And the Output:
define dso_local i32 @test() local_unnamed_addr #0 {
ret i32 ptrtoint (<2 x i32> (<2 x i32>)* @vcltz_s32 to i32)
}
define internal <2 x i32> @vcltz_s32(<2 x i32> %0) #1 {
%2 = ashr <2 x i32> %0, <i32 31, i32 31>
ret <2 x i32> %2
}
attributes #0 = { norecurse nounwind readnone uwtable willreturn "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon" }
attributes #1 = { alwaysinline norecurse nounwind readnone uwtable willreturn "frame-pointer"="non-leaf" "min-legal-vector-width"="64" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+neon" }
Can you add ARM versions of these functions? |
It seems that these instructions are unique to aarch64 and only accept signed parameters. I can't compile the version of arm on godbolt either. |
You are right. |
All are automatically generated single-parameter comparison instructions. In order to be consistent with the implementation in Clang, some changes have been made to stdarch-gen.