-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document guarantees (or lack thereof) regarding sign, quietness, and payload of NaN
s
#73328
Comments
NaN
sNaN
s
This also affects the documentation for the methods in #72568. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Related LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=45152 |
…r=Mark-Simulacrum Run standard library unit tests without optimizations in `nopt` CI jobs This was discussed in rust-lang#73288 as a way to catch similar issues in the future. This builds an unoptimized standard library with the bootstrap compiler and runs the unit tests. This takes about 2 minutes on my laptop. I confirmed that this method works locally, although there may be a better way of implementing it. It would be better to use the stage 2 compiler instead of the bootstrap one. Notably, there are currently four `libstd` unit tests that fail in debug mode on `i686-unkown-linux-gnu` (a tier one target): ``` failures: f32::tests::test_float_bits_conv f32::tests::test_total_cmp f64::tests::test_float_bits_conv f64::tests::test_total_cmp ``` These are the tests that prompted rust-lang#73288 as well as the ones added in rust-lang#72568, which is currently broken due to rust-lang#73328.
This seems... way too conservative. I know it's trying to make the best of a bad situation, and I'm sympathetic here, but please realize how hard overly broad unspecified behavior like this makes it to write robust code (As a user of Rust who came to it from C, this feels like the same kind of undefined behavior you see in the C standard in cases where all supported platforms disagree). So, my biggest concern is non-Wasm platforms. I think it would really be a huge blow to working with floats in rust to effectively zero guarantees around NaN. I don't really know a good solution here, but even just marking it as a LLVM bug on the problematic platforms (rather than deciding that this isn't a thing that Rust code gets to rely on ever) would be much better. Just as an example, if NaN payload is totally unspecified and may change at any point, implementing any ordering stronger than PartialEq for floats is impossible (including #72599), as you cannot count on NaN bitwise values to be stable across two calls of to_bits() on the same float. Same goes for things that stash f32 in a u32 and then expect to get it out again and be the same (for example, I implemented an AtomicF32 at one point on top of AtomicU32 + from_bits/to_bits. If I can't rely on stable bit values though from float => u32, things like compare_exhcange loops become not guaranteed to ever terminate. Tbat said, I also "totally unspecified behavior" is too conservative on Wasm too — I've done a bit of poking and it seems like the behavior is a lot more sane than suggested, although it does violate IEEE734 and is probably not 100% intentional. Basically: LLVM's behavior here is inherited from the wasm/js runtime, which canonicalizes NaNs whenever going from bits => float, as it wants to be able to guarantee certain things about which bit patterns are possibly in the float — certain NaNs are off limits. That means:
This is non-ideal but is still way easier to reason about and build on top of than arbitrary unspecified behavior. Yeah that's the basic gist of my thoughts. Changing the documented guaranteed of from_bits/to_bits globally like that would totally neuter those APIs. I'm sympathetic to the position you're in and not having great choices, but that kind of change feels like very much the wrong call, and making the call be this kind of unspecified behavior feels really bad on any platform... P.S. I accidentally posted an incomplete version of this comment by hitting ctrl+enter in the github text box, sorry if you saw that — really should just do these in a text editor first. |
I am open to better suggestions. I know hardly anything about floating point semantics, so "totally unspecified" is an easy and obviously "correct" choice for me to reach for. If someone with more in-depth knowledge can produce a spec that is consistent with LLVM behavior, I am sure this can be improved upon. However, the core spec of Rust must be platform-independent, so unless we consider this a platform bug (which I think is what we do with the x87-induced issues on i686), whatever the spec is has to encompass all platforms. In principle, certain platforms can decide to guarantee more than others, but that is a dangerous game as it risks code inadvertently becoming non-portable in the worst possible way -- usually "non-portable" means "fails to build on other platforms", now it would silently change behavior. Maybe we can handle this in a way similar to endianess, although the situation feels different. And all of this is assuming that we can get LLVM to commit to preserving NaN payloads on these platforms. You are saying that this issue only affects wasm(-like) targets, but is there a document where LLVM otherwise makes stronger guarantees? The fact that issues only have been obvserved on these platforms does not help, we need an explicit statement by LLVM to establish and maintain this guarantee in the future.
So if I understand correctly, on wasm, the float => bit cast that is inherent in such a total order would canonicalize NaNs. This on its own is not a problem as this is a stable canonicalization, and that's why you think "unstable NaNs" are too broad. Is that accurate? However, when you combine that with LLVM optimizing away "bit => float => bit" roundtrips (does it do that?), then this already brings us into an unstable situation. Some of the comparisons might have that optimization applied to them, and others not, so suddenly the same float (obtained via a bit => float cast) can compare in two different ways. It is easy to make a target language spec such as wasm self-consistent, but to do the same on a heavily optimized IR like LLVM's or surface language like Rust is much harder. |
No, My point with that paragraph was not that the LLVM behavior is bad (although I am not a fan), but that changing Rust's guarantees to: "the bitwise value of a NaN is unspecified and may change at any point during program execution" is both
* (always... except for what I say in my next response)
I don't know if it does it on Wasm, but it's obviously free to do this on non-Wasm platforms (and I think I've seen it there, but it's hard to say and I don't have code I'm thinking of on hand). I'd hope it wouldn't do this on Wasm, and would argue that if it does optimize that away it's an LLVM bug for that platform, but... yeah. Possible.
Honestly that seems like the sanest decision to me, since the alternative is essentially saying that Rust code can't expect IEEE754-compliant floats anymore. And so, I think x87 is a good example because it's also an example of non-IEEE754 compliance, although probably a less annoying one in practice. Concretely, I wouldn't have complained about this at all if it were listed as a platform bug. Instead, my issue is entirely with all compliant Rust code loosing the ability to reason about float binary layout, which has been extremely useful in stuff like scientific computing, game development, programming language runtimes, math libraries, ... All things Rust is well suited to do, by design. This wouldn't cripple those by any means, but it would make things worse for several of them. Admittedly, in practice, unless it's flat out UB, I suspect people will just code to their target and not to the spec, which isn't great either, but honestly to me it feels like it might be better than Rust genuinely inheriting this limitation from the web platform. (Ironically, this would also prevent writing a runtime in Rust that does the optimization which is the reason Wasm and JS runtimes want to canonicalize their NaNs. Although that optimization was already fairly unportable anyway) |
Oh I see... but that is not observable until you cast back? Or does wasm permit transmutation, like writing a float into memory and reading it back as an int without doing an explicit cast? (IIRC their memroy is int-only so you'd have to cast before writing, but I might misremember.)
Whether it can do that or not depends solely on the semantics of LLVM IR, which (as far as I know) are not affected by whether you are compiling to Wasm or not. That is the entire point of having a single uniform IR. There is no good way to make optimizations in a highly optimized language like Rust or LLVM IR depend on target behavior -- given how they interact with all the other optimizations, that is basically guaranteed to introduce contradicting assumptions. Also, I don't think there is much point in discussing what we wish LLVM would do. We first need to figure out what it is doing.
Ah, but this is getting to the heart of the problem -- what if you implement a wasm runtime in Rust which uses this optimization, and compile that to wasm? Clearly that cannot work as the host wasm is already "using those bits". So, it is fundamentally impossible to have a semantics that achieves all of
I do feel like it is slightly exaggarated to say that all these usecases rely on stable NaN payloads. That said, there seems to be a fundamental conflict here between having a good cross-platform story (consistent semantics everywhere) and supporting low-level floating point manipulation. FP behavior is just not consistent enough across platforms. |
However, note that not just wasm has strange NaN behavior. We also have some bugs affecting x86_64: #55131, #69532. Both (I think) stem from the LLVM constant propagator (in one case its port to Rust) producing different NaN payloads than real CPUs. This means that if we guarantee stable NaN payloads in x86_64, we have to stop const-propagating unless all CPUs have consistent NaN payload (and then the const propagator needs to be fixed to match that). So until LLVM commits to preserving NaN payloads on some targets, there is little we can do. It seems people already rely on that when compiling wasm runtimes in LLVM that use the NaN optimization, so maybe it would not be too hard to convince LLVM to commit to that? |
This isn't really right tho is it? LLVM-IR includes tons of platform specific information. The fact that making LLVM-IR cross platform was non-viable was part of the motivation behind Wasm's current design even. From the other issue:
This would be totally fine with me FWIW — as soon as you do arithmetic on NaN all portability is out the window in practice and in theory. My concern is largely with stuff like:
So, while I just gave you two examples of very much non-portable code...
My big concern still comes back to the notion that these payloads are "unspecified values which may change at any time" according to Rust. The way I interpret that, and the general feeling of this conversation, means that there's no guarantee that target-specific things like these are even guaranteed to work reliably on the target in question.
That's why I said "This wouldn't cripple those by any means", although honestly the SIMD stuff would be pretty bad if it were actually broken. I also fully expect those cases to blindly continue doing things to NaN non-portably (and possibly non-deterministically).
This is surprising, because I thought it was the whole point of LLVM's APFloat code (which even goes as far as to support like the horrible PowerPC long double type...). That said, it's not like I can argue with facts, if those bugs are happening, then they're happening... But are we sure those aren't just normal bugs in LLVM? That said the only reason I wouldn't be willing to say "I don't care that much about what happens to NaN during const prop" is that you can't know when LLVM will happen to see enough to do more const prop. That said, it seems totally unreasonable and very fragile to me to rely on things like:
That stuff is totally nonportable (IEEE754 recommends but doesn't require any of it) and unreliable both at compile time and at runtime. Again, my concern is more unexpected fallout here in stuff that expects NaN to go through smoothly. Just took a peek at https://webassembly.github.io/spec/core/exec/numerics.html (and elsewhere in the spec) and regret not doing so sooner. In particular, there's a lot of mention on when canonicalization can happen, but none of the places are on load/reinterpret. And so what's in there is pretty close to the suggestion you had earlier (the "less drastic alternative)... and what I suggested as the things that are totally nonportable. And, it also definitely contradicts what I said before about when canonicalization happens (which mirrored what happened in ASM.js, what I seemed to see in my testing earlier, and would have explained Which would also (maybe?) explain why the bugs happen on all platforms, maybe? ... Ugh, this is still a bit jumbled sorry, some it this needs to be unified and reordered, and more digging into what the deal with the discrepancy is, but I have to run, unfortunately. |
It makes many platform-specific things such as pointer sizes etc explicit. But that is very different from an implicit change in behavior. Your proposal would basically require many optimizations to have code like And this would affect many optimizations as it makes float point operations and/or-casts non-deterministic, which is a side-effect! So everything that treats them as pure operations needs to be adjusted.
There's like 5 other issues, which one do you mean?^^ You are quoting this comment I think.
(This was for making FP operations pick arbitrary NaNs.)
then you are no longer allowed to "inline" the definition of However, maybe we can make it deterministic but unspecified? As in, after each floating-point operation, if the result is NaN, something unspecified happens with the NaN bits, but given the same inputs there will definitely always be the same output? The main issue with this is that it means that const-prop must exactly reproduce those NaN patterns (or refuse to const-prop if the result is a NaN).
So is it the case that all that code would be okay with FP operations clobbering NaN bits?
Rust will probably just do whatever LLVM does, once they make up their mind and commit to a fixed and precise semantics. I think you are barking up the wrong tree here, I don't like unspecified values any more than you do. ;) I am just trying to come up with a consistent way to describe LLVM's behavior. I'm a theoretical PL researcher, so that's something I have experience with that I am happy to lend here -- define a semantics that is consistent with optimizations and compilation to lower-level targets. However, not knowing much about floating-point makes this harder for me than it is for other topics. So I am relying on people like you to gather up the constraints to make sure the resulting semantics is not just consistent with LLVM but also useful. ;) It might turn out that that's impossible, in which case we can hopefully convince LLVM to change.
They might well be bugs! Since you seem to know a lot about floating-point, it would be great if you could help figure that out. :)
Right, that's exactly the point -- const-prop must not change what the program does. So either it must produce the exact same results as hardware, or else we have to say that the involved operation is non-deterministic.
So what is the executive summary? A quick glance shows that these operations are definitely non-deterministic. So scratch all I said about this above, this basically forces LLVM to never ever duplicate floating-point instructions. Any proposals for (a) figuring out if they are doing this right and (b) documenting this in the LLVM LangRef to make sure they are aware of the problem? |
@ecstatic-morse you listed #73288 in the original issue here, but isn't that a different problem? Namely, this issue here is about NaN bits in general, whereas #73288 is specific to i686 and thus seems more related to #72327. (I don't think we have a meta-issue for "x87 floating point problems", but maybe we should.) |
#72327 affects only i586 targets (x86 without SSE2). This is a tier 2 platform, and the last x86 processor without SSE2 left the plant about 20 years ago, so I would have no problem exempting it from whatever guarantees around NaN payloads we wish to make. However, #73288 affects i686 (the latest 32-bit x86 target) as well, which is tier 1. Obviously, we could (and maybe should) exempt all 32-bit x86 targets from the NaN payload guarantees, but I consider #73288 to be of greater importance than issues only affecting i586. As an aside, I will note that "Unless we are prepared to guarantee more" was doing a lot of work in the OP. I'd be very happy if we came up with a stricter set of semantics that we can support across tier 1 platforms (possibly exempting 32-bit x86) and implemented them. However, doing so will require a non-trivial amount of work, much of it on the LLVM side. I think that, in the meantime, we should explicitly state where we currently fall short in the documentation of affected APIs, similar to #10184. That's what this issue is about. |
Also, look out for my latest crate, |
The only way this is not a bug is if evaluation is non-deterministic. Rust has the same evaluation rules for compile-time and run-time. Otherwise there'd be two Rust languages and we'd have a horrible mess...
Of course, the actual implementation is never non-deterministic. But the specification of Rust has to be non-deterministic here, or we have to change either compile-time or run-time behavior. |
IMO it is a bug.
I mean, it's really easy for me to argue that the changing the compile-time behavior is right. Unfortunately, that's difficult because it requires changing how APFloat works in LLVM, and it's not a trivial change either. That said, IMO the solution to hard, low-impact bugs shouldn't be to rework the language so that they're not bugs. Eventually they should be fixed, even if it's not a high priority. Additionally, a different Rust compiler probably wouldn't have the same difficulty here. |
Also, that's not even true. The original code sample in that issue shows two different behaviors at runtime: use std::ops::Mul;
fn main() {
assert_eq!(1.0f64.copysign(f64::NEG_INFINITY.mul(0.0)), -1.0f64);
assert_eq!(1.0f64.copysign(f64::NEG_INFINITY * 0.0), -1.0f64);
} |
I've been meaning to say this, but the reproducibility rules are probably a bit of a red herring. They're only really meant to apply to programs that opt into a subset of floating point semantics.
I believe this is due to one of these being impacted by LLVM's constant propagation and the other not. |
Sure. But that doesn't change the fact that this is runtime code. And to my knowledge, LLVM doesn't consider this optimization a bug, since the result produced by LLVM is legal according to the IEEE floating-point spec. There isn't even an LLVM bugreport for the
It is my understanding that some aspects of the bitwise results of floating-point operations (in particular for NaNs) are inherently not defined in the LLVM IR semantics (or in the IEEE semantics, which LLVM [mostly?] follows). This is not a bug, it is part of their spec. So if we want to use LLVM as the backend, we have no choice but to also incorporate a similar kind of non-determinism into the Rust semantics (or lobby for LLVM to change their spec). This is not reworking the language, it is properly understanding the consequences of what it means to say that Rust uses IEEE floating-point semantics. If agree that it would be nice to have deterministic floating-point operations, but that's just not realistic when LLVM (and WebAssembly) made a different choice. |
Put differently: a bug usually means that something is not working according to spec. I don't see that happen here (but I keep getting lost in the details of FP semantics). My understanding is that this issue is about better documenting the Rust spec, not about changing the behavior of One could argue that the spec has a bug due to being too liberal, but given that the spec we are talking about here is the LLVM IR spec and by extension the IEEE FP spec, that does not seem like a particularly useful of constructive approach. (Specs can certainly have bugs when they fail to be self-consistent or when they do not adequately reflect intended behavior, but that does not seem to be the case here.) |
I do not believe lobbying LLVM for a hardware-respecting behavior seems that unlikely. It may make some proofs regarding optimizations easier, for one. |
I don't see how that would be the case.
Fair. But this is the wrong forum to do so. ;) |
What about refusing to constant-evaluate any operation that is non-reproducible? |
By const-evaluate I assume you mean constant propagation / constant folding, i.e., the optimization pass that tries to avoid redundant computations at runtime? That is distinct from CTFE (compile-time function evaluation, also sometimes called const evaluation), which is about computations that the spec says happen at compile-time (such as the initial values of a We could do that in rustc, but can we convince LLVM to stop folding |
File a bug against LLVM? I don’t know 🙂 |
…shtriplett Improve floating point documentation This is my attempt to improve/solve rust-lang#95468 and rust-lang#73328 . Added/refined explanations: - Refine the "NaN as a special value" top level explanation of f32 - Refine `const NAN` docstring: add an explanation about there being multitude of NaN bitpatterns and disclaimer about the portability/stability guarantees. - Refine `fn is_sign_positive` and `fn is_sign_negative` docstrings: add disclaimer about the sign bit of NaNs. - Refine `fn min` and `fn max` docstrings: explain the semantics and their relationship to the standard and libm better. - Refine `fn trunc` docstrings: explain the semantics slightly more. - Refine `fn powi` docstrings: add disclaimer that the rounding behaviour might be different from `powf`. - Refine `fn copysign` docstrings: add disclaimer about payloads of NaNs. - Refine `minimum` and `maximum`: add disclaimer that "propagating NaN" doesn't mean that propagating the NaN bit patterns is guaranteed. - Refine `max` and `min` docstrings: add "ignoring NaN" to bring the one-row explanation to parity with `minimum` and `maximum`. Cosmetic changes: - Reword `NaN` and `NAN` as plain "NaN", unless they refer to the specific `const NAN`. - Reword "a number" to `self` in function docstrings to clarify. - Remove "Returns NAN if the number is NAN" from `abs`, as this is told to be the default behavior in the top explanation.
Improve floating point documentation This is my attempt to improve/solve rust-lang/rust#95468 and rust-lang/rust#73328 . Added/refined explanations: - Refine the "NaN as a special value" top level explanation of f32 - Refine `const NAN` docstring: add an explanation about there being multitude of NaN bitpatterns and disclaimer about the portability/stability guarantees. - Refine `fn is_sign_positive` and `fn is_sign_negative` docstrings: add disclaimer about the sign bit of NaNs. - Refine `fn min` and `fn max` docstrings: explain the semantics and their relationship to the standard and libm better. - Refine `fn trunc` docstrings: explain the semantics slightly more. - Refine `fn powi` docstrings: add disclaimer that the rounding behaviour might be different from `powf`. - Refine `fn copysign` docstrings: add disclaimer about payloads of NaNs. - Refine `minimum` and `maximum`: add disclaimer that "propagating NaN" doesn't mean that propagating the NaN bit patterns is guaranteed. - Refine `max` and `min` docstrings: add "ignoring NaN" to bring the one-row explanation to parity with `minimum` and `maximum`. Cosmetic changes: - Reword `NaN` and `NAN` as plain "NaN", unless they refer to the specific `const NAN`. - Reword "a number" to `self` in function docstrings to clarify. - Remove "Returns NAN if the number is NAN" from `abs`, as this is told to be the default behavior in the top explanation.
I have written a Pre-RFC on our floating-point guarantees, which is almost exclusively about NaNs. That document describes what are currently the best possible guarantees we can provide, given LLVM's documentation. However, LLVM also seems to be open to providing stronger guarantees. |
To be pedantic, the Vortex86DX3 is still being made and only supports SSE Edit: #35045 (comment) mentioned in 2016 that he's using a VortexX86 |
I'm more concerned about someone using |
The RFC rust-lang/rfcs#3514 makes a concrete proposal for our guarantees for the bits of NaNs. |
I think this was resolved by #129559. |
NaNs can behave in surprising ways. On top of that, a very common target is inherently buggy in more than one way. But on all other targets we actually follow fairly clear, if improperly documented, rules. See here for the current status.
Original issue
Several issues have been filed about surprising behavior of NaNs.
0.0 / 0.0
changed depending on whether the right-hand side came from a function argument or a literal.f32::from_bits(x).to_bits()
was not always equal tox
.The root cause of these issues is that LLVM does not guarantee that NaN payload bits are preserved. Empirically, this applies to the signaling/quiet bit as well as (surprisingly) the sign bit. At least one LLVM developer seems open to changing this, although doing so may not be easy.
Unless we are prepared to guarantee more, we should do a better job of documenting that, besides having all 1s in the exponent and a non-zero significand, the bitwise value of a NaN is unspecified and may change at any point during program execution. In particular, the
from_bits
method onf32
andf64
types currently states:and
These statements are misleading and should be changed.
We may also want to add documentation to
{f32,f64}::NAN
to this effect, see #52897 (comment).cc #10186?
The text was updated successfully, but these errors were encountered: