-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
float rounding is slow #55107
Comments
Edit: reworded If you use You can test this by enumerating all f32 values (the same works for f64, but not as easily enumerable): https://gist.github.com/m1el/dc18a85b82357589825d9ffa4532b0a6 For example, |
@raphlinus , I'm actually surprised to discover that At the same time, people may also find this a good opportunity to implement the rest of the five rounding modes specified by IEEE 754: https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules . Would be ideal if LLVM could be the one handling all of this. |
@m1el I'm afraid I don't see how that's relevant to this issue. Please file a separate bug if you think that the behavior of |
@bstrie as far as I can tell, the behavior of |
@bstrie Right. I do think there are strong benefits to llvm implementing this; it's a lot easier to make auto-vectorization work, and then all of (round, floor, ceil) will be similar. @m1el I do appreciate the heads-up, I hadn't considered the fact that the addition would round up on precision loss. I agree it's technically not relevant to this issue but is a good warning to people who might come across this issue. It's also a signal of the importance of fixing this problem, as the obvious workaround is flawed. Personally I would be fine with the rounding-on-half behavior to change. The number of people who will get hit by the performance and the inconsistency with SIMD is large. The number of people who actively depend on |
Poking around, there may be some tentative LLVM support for this after all, though it's marked "experimental": https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics https://llvm.org/docs/LangRef.html#id1888 https://llvm.org/docs/LangRef.html#id1939 @raphlinus , I don't know a lot about how rounding modes work at a low-level, but does that look like it might be a promising start? |
@bstrie On a first glance, these don't seem to be what we're looking for, it seems like these experimental intrinsics affect rounding mode for the precision loss in other floating point operations. Also, I didn't specifically see round-half-to-even behavior listed. One thing we should probably do before nailing down new semantics for rounding is research what other chips implement performantly (I'm personally mostly concerned about arm). Want me to look into that? |
|
I'm still not convinced the I verified with llc and hand-editing IR code that We already have intrinsics::nearbyintf32 which generates So it looks like the way forward is to push on llvm for a new intrinsic. Having started reading their stuff, this is pretty deep water; it has to be supported through many optimization and codegen components, and there needs to be a fallback implementation for all targets, because this is not something that's represented anywhere in libm. Another possibility to explore is to expose Lastly, I checked availability of these operations on arm64, and it's much nicer than Intel - the FRINTN instruction exists in both scalar and vector, with "ties to even" behavior, as do instructions for the all the libm rounding functions. So I believe this is a small amount of additional support for the idea that it should be exposed as an intrinsic in llvm. |
I looked also at wasm; here's what I found. Short story is that current codegen is both wrong (for emscripten) and inefficient. The code generated for wasm32-unknown-emscripten for (Note: edited, I thought I had a clever fix, but it was a testing error). For wasm32-unknown-unknown, the generated code seems to match japaric/libm. I haven't checked this code for correctness, but it looks pretty carefully written and very very slow (it's something like 92 wasm instructions and lots of branches). (Edit: I did verify that this produces identical results as libm, with the exception that std produces negative zero for negative inputs near zero and japaric/libm produces positive zero). In any case, that's the current situation. The designers of wasm are also a fan of round-half-to-even behavior, and so wasm includes the Lastly, I found that there is an architecture-specific intrinsic for this already in llvm: |
I branched bugs to Emscripten and japaric/libm, referenced above. I have also developed a workaround. I believe this the best implementation of round on x86 and wasm (on aarch64, the FRINTA instruction is better): fn round(x: f32) {
((x.abs() + (0.25 - 0.5 * f32::EPSILON)) + (0.25 + 0.5 * f32::EPSILON)).floor().copysign(x)
} However, it requires exposing the A significant advantage of the above implementation (with copysign) is that it should be auto-vectorizable, which the current llvm.floor.f32 intrinsic is not. Arguably this codegen choice should be made in llvm rather than rust, considering that arm needs to be special-cased. I also plan to file an issue against llvm asking that the FRINTN (aarch64) / nearby (wasm) intrinsic be added. Lastly, I'd like to nip the naming bikeshed in the bud by declaring that the name for this function in the Rust standard library be "propinquitous". |
Also note that my idea (propagated by Twitter) is being reviewed in Julia: JuliaLang/julia#29700 |
In case you're interested, we had a long argument in Julia about the specific behaviour of As you've discussed, on x86 SSE4.1 hardware (basically anything since 2011) there is actually an instruction for this, but LLVM doesn't expose an intrinsic. We settled on just using If you really want to keep C behaviour, then some options (in Julia notation, but hopefully it should be clear what they mean) are:
or
I think we settled on the latter because it sets the floating point status flags correctly, but if you're not worried about that, I suspect the first will be faster. |
@simonbyrne Thanks, that is interesting for sure, as is the pointer to the C proposal. I think that makes the case for LLVM implementing this even stronger. The Rust translation of the first of the two code snippets is: (x + (0.5 - 0.25 * f32::EPSILON).copysign(x)).trunc() This is, I think, the best code for C-semantics round compatibility. I was wondering how I missed this, because I do remember testing this. I think I've reconstructed my thinking - I was originally looking for an implementation of round up on tie break, and My personal preference is still changing the semantics of round to even, because I think "round" is much more discoverable than whatever other name we come up with. But we'd want to get a good handle on how much breakage that would cause before making such a change. |
For what it’s worth, Python changed from C behavior in Python 2 to round to
even in Python 3, and it doesn’t seem to have caused many issues.
|
I decided to bench and test different implementations of round. Here's the program I used: https://gist.github.com/m1el/873de570c1af6131f707a850108c6ced The results on my machine (Core i7-5820K@3.0Ghz):
Please note that using All functions were tested to produce bit-equivalent results with the standard library. |
@m1el brings up a good point which I didn't clarify above. I don't have time today to dig very deeply, but there are three cases we have to consider:
Whose responsibility is this, ours or llvm? A case can be made for either, but a problem with the latter is that they're constrained to respect C semantics for things like status flags for inexact rounding, and (I believe) we're not. So we might be able to generate better code based on explicit casing of target_cpu. |
LLVM is not constrained by floating point exception flags, or even settings for dynamic rounding modes. In fact it's the opposite, LLVM has always gladly ignored these issues and is only now slowly growing an additional set of intrinsics that do respect those things. The long-existing LLVM math intrinsics (e.g. |
This patch adds a `copysign` function to the float primitive types. It is an exceptionally useful function for writing efficient numeric code, as it often avoids branches, is auto-vectorizable, and there are efficient intrinsics for most platforms. I think this might work as-is, as the relevant `copysign` intrinsic is already used internally for the implementation of `signum`. It's possible that an implementation might be needed in japaric/libm for portability across all platforms, in which case I'll do that also. Part of the work towards rust-lang#55107
I'm surprised this isn't using One way to get round-to-even without native intrinsics (again, apologies for Julia code, but hopefully it's reasonably obvious):
Of course, this assumes no x87 excess precision business, but in that case you should just use |
Add a `copysign` function to f32 and f64 This patch adds a `copysign` function to the float primitive types. It is an exceptionally useful function for writing efficient numeric code, as it often avoids branches, is auto-vectorizable, and there are efficient intrinsics for most platforms. I think this might work as-is, as the relevant `copysign` intrinsic is already used internally for the implementation of `signum`. It's possible that an implementation might be needed in japaric/libm for portability across all platforms, in which case I'll do that also. Part of the work towards rust-lang#55107
This patch adds a `copysign` function to the float primitive types. It is an exceptionally useful function for writing efficient numeric code, as it often avoids branches, is auto-vectorizable, and there are efficient intrinsics for most platforms. I think this might work as-is, as the relevant `copysign` intrinsic is already used internally for the implementation of `signum`. It's possible that an implementation might be needed in japaric/libm for portability across all platforms, in which case I'll do that also. Part of the work towards rust-lang#55107
@simonbyrne It doesn’t seem like this can overflow, and as far as I can tell it keeps giving correct results for arbitrarily large floats, so you can skip your conditional check and just do
https://observablehq.com/d/db47dc21f691d5e9 The only slight subtlety is deciding what the correct behavior should be for values in the range −0.5..−0. Should these round to −0 or +0? |
That doesn't work for values in the binade where
I think the convention is the sign should always match. |
@simonbyrne How about this one?
|
The great thing about
(I apologize again for spamming this thread with Julia code...) |
I’m a little confused what that code is doing. It doesn’t look like But okay, darn. I wonder if there’s anything else we can do. (Still, if you need the conditional the subtractive version is probably still better to use, since the other branch should be pretty rare.) |
Good point, now updated. |
…=pnkfelix,m-ou-se,scottmcm Add `round_ties_even` to `f32` and `f64` Tracking issue: rust-lang#96710 Redux of rust-lang#82273. See also rust-lang#55107 Adds a new method, `round_ties_even`, to `f32` and `f64`, that rounds the float to the nearest integer , rounding halfway cases to the number with an even least significant bit. Uses the `roundeven` LLVM intrinsic to do this. Of the five IEEE 754 rounding modes, this is the only one that doesn't already have a round-to-integer function exposed by Rust (others are `round`, `floor`, `ceil`, and `trunc`). Ties-to-even is also the rounding mode used for int-to-float and float-to-float `as` casts, as well as float arithmentic operations. So not having an explicit rounding method for it seems like an oversight. Bikeshed: this PR currently uses `round_ties_even` for the name of the method. But maybe `round_ties_to_even` is better, or `round_even`, or `round_to_even`?
…=pnkfelix,m-ou-se,scottmcm Add `round_ties_even` to `f32` and `f64` Tracking issue: rust-lang#96710 Redux of rust-lang#82273. See also rust-lang#55107 Adds a new method, `round_ties_even`, to `f32` and `f64`, that rounds the float to the nearest integer , rounding halfway cases to the number with an even least significant bit. Uses the `roundeven` LLVM intrinsic to do this. Of the five IEEE 754 rounding modes, this is the only one that doesn't already have a round-to-integer function exposed by Rust (others are `round`, `floor`, `ceil`, and `trunc`). Ties-to-even is also the rounding mode used for int-to-float and float-to-float `as` casts, as well as float arithmentic operations. So not having an explicit rounding method for it seems like an oversight. Bikeshed: this PR currently uses `round_ties_even` for the name of the method. But maybe `round_ties_to_even` is better, or `round_even`, or `round_to_even`?
…=pnkfelix,m-ou-se,scottmcm Add `round_ties_even` to `f32` and `f64` Tracking issue: rust-lang#96710 Redux of rust-lang#82273. See also rust-lang#55107 Adds a new method, `round_ties_even`, to `f32` and `f64`, that rounds the float to the nearest integer , rounding halfway cases to the number with an even least significant bit. Uses the `roundeven` LLVM intrinsic to do this. Of the five IEEE 754 rounding modes, this is the only one that doesn't already have a round-to-integer function exposed by Rust (others are `round`, `floor`, `ceil`, and `trunc`). Ties-to-even is also the rounding mode used for int-to-float and float-to-float `as` casts, as well as float arithmentic operations. So not having an explicit rounding method for it seems like an oversight. Bikeshed: this PR currently uses `round_ties_even` for the name of the method. But maybe `round_ties_to_even` is better, or `round_even`, or `round_to_even`?
#96710 seems to largely close this issue: Changing the semantics of the existing |
The scalar fallback for the sinewave benchmark in fearless_simd is very slow as of the current commit, and the reason is the f32::round() operation. When that's changed to (x + 0.5).floor() it goes from 1622ns to 347ns, and 205ns with target_cpu=haswell. With default x86_64 cpu, floorf() is a function call, but it's an efficient one. The asm of roundf() that I looked at was very unoptimized (it moved the float value into int registers and did bit fiddling there). In addition, round() doesn't get auto-vectorized, but floor() does.
I think there's a rich and sordid history behind this. The C standard library has 3 different functions for rounding:
round
,rint
, andnearbyint
. Of these, the first rounds values with a 0.5 fraction away from zero, and the other two use the stateful rounding direction mode. This last is arguably a wart on C and it's a good thing the idea doesn't exist in Rust. In any case, the default value is FE_TONEAREST, which rounds these values to the nearest even integer (see Gnu libc documentation and Wikipedia; the latter does a reasonably good job of motivating why you'd want to do this, the tl;dr is that it avoids some biases).The implementation of f32::floor is usually intrinsics::floorf32 (but it's intrinsics::floorf64 on msvc, for reasons described there). That in turn is llvm.floor.f32. Generally the other round functions are similar, til it gets to llvm. Inside llvm, one piece of evidence that "round" is special is that it's not listed in the list of instrinsics that get auto-vectorized.
Neither the C standard library nor llvm intrinsics have a function that rounds with "round half to even" behavior. This is arguably a misfeature. A case can be made that Rust should have this function; in cases where a recent Intel CPU is set as target_cpu or target_feature, it compiles to
roundps $8
(analogous to$9
and$a
for floor and ceil, respectively), and in compatibility mode the asm shouldn't be any slower than the existing code. I haven't investigated non-x86 architectures though.For signal processing (the main use case of fearless_simd) I don't care much about the details of rounding of exactly 0.5 fraction values, and just want rounding to be fast. Thus, I think I'll use the _mm_round intrinsics in simd mode (with round half to even behavior) and (x + 0.5).floor() in fallback mode (with round half up behavior). It's not the case now (where I call f32::round) that the rounding behavior matches the SIMD case anyway. If there were a function with "round half to even" behavior, it would match the SIMD, would auto-vectorize well, and would have dramatically better performance with modern target_cpu.
The text was updated successfully, but these errors were encountered: