-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize having the concept of "validity invariant" and "safety invariant"? Under which name? #539
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I don't like these. Many library invariants are not safety invariants (e.g., if you have a non-deterministic So there's at least:
I think that "safety invariant" is a good name, but "validity invariant" is not a particularly good name for "if you violate this demons will pop out of your nose". tho maybe "language invariant" is a good name (so language invariant/safety invariant). |
I think any two names that do not share the first letter are better than two names that do share the same first letter. So I vote for validity/safety over lang/library. But also any two other words of differing first letter would be fine too: lang and safety, for example. |
Well, you've made your own choices here by calling these "library invariants" and "language invariants"; choices I would not agree with. ;)
So do you propose we also call it language UB and safety UB? That doesn't make a lot of sense, and IMO it'd be good to align those terms with the invariants that cause the respective UB when violated. I don't recall us ever abbreviating validity/safety invariant with the first letters, so I don't think it's very important that they have different first letters. |
But there isn't actually safety UB is there? All safety invariants are there because when violated they can allow otherwise fine code to end up breaking a language rule somewhere later. As far as I'm aware, there's just the one kind of UB. |
Library UB is real and not the same as language UB. For instance, it is library UB to call any |
(From the OP) The UB you are referring to is also called "language UB", and "library UB" is used in contrast to mean basically "a violation of library invariants/preconditions/etc that could lead to (language) UB depending on usage/library implementation details/etc", which is useful to have a succinct term for. |
I would call them "language UB" (or just UB as I think there is only one thing that should get that name) and "safety violation" respectively. |
Yeah, see, I'm in agreement with @digama0. If you have a I don't think we should call that state of having invalid bytes "library UB". I don't think we should call any other similar situations "library UB" either. I do like the term "safety violation". Otherwise you have to start explaining to people that sometimes there's something called UB that they're allowed to do anyway, as long as they fix the situation before anything "really bad" happens. And that's a very easy thing for people to misunderstand, or only halfway remember months after the fact. |
Violating such library preconditions is often called UB, and I think it makes sense. UB basically means "you violated the contract of this API and now all guarantees are void". Language UB is about the contract between the programmer and the compiler (where the "API" is the language itself); library UB is about the contract between the library author and the client author. |
Right, except I'm telling you that what people often end up hearing is, "I'm allowed to do some types of UB". I think that's bad, and if we're going to deliberately pick terms, we should deliberately pick terms that don't let people think that "sometimes UB is okay". Because that's a real thing people have said to me, and I've had to stop and explain to them what's "actually UB", and that they really can't do it, not even sneakily. So call them lang invariants and lib invariants if you want, but i strongly suggest that you don't call them that if you want to draw a connection to "lang ub" and "lib ub", because those terms are not good terms that i have seen do harm to people's understanding of rust on multiple occasions over the years. |
You're not allowed to do library UB with other people's libraries either. I have seen this analogy work pretty well, too, so I wouldn't be so quick to discard it. |
I know it does work for some people, but if it works for some people and not for others, then I don't think it entirely works. Of course you should not do any library UB, but the fact that you can actually do it, without complaints from miri (eg: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=919353ecbc1b830d3f8daa6f3180888e), gives some people wrong impressions about what terms even mean. I don't think we should design our terms just for experts, we should also design them for the uninformed and forgetful too. There's a lot more uninformed and forgetful programmers than expert ones. |
I agree with @Lokathor that people do library UB and the reason is what I call "robust functions" in the unsafe mental model and what you call "super safe functions". As a library author you can document that you accept more values than what the safety invariant says. Typical examples are non-UTF-8 |
@Lokathor Nothing works for all people, it's a matter of tradeoffs. 🤷 I have registered your opinion, let's see what other people say. @ia0 Library UB is a violation of the contract between the library author and the client. If the library author documents that it's okay to call some |
I think we all agree that if we use the term UB it should be for situations that always indicate incorrect code. ie. it's never ok for a program to exhibit something we have decided to call UB. It seems the argument is more over whether violating library invariants should always indicate a problem with the program, or whether there are some situations where violating library invariants is acceptable - if there are such cases then we shouldn't call it UB. I tend to agree with @RalfJung that it should be considered UB as long as we:
There are still some problems though:
|
Ah ok my bad. I thought the analogy between "validity invariant" and "language UB" on one side and "safety invariant" and "library UB" on the other, was stronger1. I fear this will bring additional confusion to the confusion @Lokathor is mentioning2. Footnotes
|
This is not a real distinction.
The parallel between language UB and library UB is actually quite strong. |
I fear this is a separate discussion (so probably better opening a thread on zulip). You're always allowed to look at the implementation of a crate. Simply if you rely on it for soundness (or correctness if you care about that too), then you need to pin the version you audited with the
If you do this, you're breaking the contract of the compiler. If you have language UB, your program is undefined (literally, you can't say what it does) and there's no guarantee on the output of the compiler (it may not terminate, it may crash, it may produce a buggy program, or even a program that seems to "work" until an attacker figures out it doesn't). |
Am I also allowed to look at the implementation of the compiler? What about libstd?
And if you break a library invariant you're breaking the contract of the library. Literally you can't say what the library will do. And there's no guarantee on the output of the library. Both the language and the library provide a contract with the user. There's an exact parallel between them. If I pin the version of the compiler and exploit its implementation details, I can be confident in what it will output. I won't be writing "valid" Rust code anymore because I broke the contract of the language. In the same way if I pin a version of a library and exploit its implementation details, I can be confident in what it will output. I won't be a "valid" user of that library anymore because I broke the contract of the library. |
You are allowed to look at both and theoretically could pin the (rustc, cg backend, llvm/cranelift/gcc} triple, and do whatever the heck you want (including language UB) if you know what the result will be (and that result is desirable), yes. But as you alluded to, you would no longer be a valid user (of either Rust as a language or rustc as a compiler). |
But you never were. Your initial question was when the same user writes 2 crates: You're not writing the compiler, so that's a different situation. To push the example further, the standard library authors are somehow the same (ok not really but close enough, and in particular they have an out-of-band contract) as those of the compiler, which is why the standard library is allowed to use language UBs that normal Rust users can't. (This is probably only tolerated because the actual language UBs are not stable yet.) |
Actually I think std doesn't use UB (or at least we try not to), std fixes some underspecified language nondeterminism like struct layout (which is not UB if you "guess it right"). (Queue Jubilee telling me about unspeakable horrors in std or compiler-builtins.) |
It doesn't, though it 100% could from the perspective of an implementor. It's called implementor's privilege. Whether or not it should from a design perspective of course is a separate thing. (Also, nit, struct layout isn't really underspecified - it's unspecified, intentionally) |
No, that was one example of many. My point is only that there is a direct parallel between library UB and language UB. The parallel for The parallel between The parallel between Rust does not support this last use-case. The author of There is no qualitative difference between these two types of UB, only that it is more common to rely on library implementation details than it is to rely on language implementation details. |
(Also, there's a fourth case - |
The difference here is that I think one way of putting it is that undefined behavior is always considered to be a program error, and Someone other than |
Actually, I found a world1 where library UB is the same as language UB. In this world, documentation is part of the program and participate to its semantics. Compilers are thus allowed to look at it and optimize based on their pre-condition that the program does not have library UB. The only difference between library UB and language UB at this point is whether the semantics was written by the language or the library, reflecting exactly the idea behind those names. Footnotes
|
And I found a point of view in the current world that would make (rigid) library UB the same as language UB, by moving the distinction between library UB and language UB to malleable library UB and rigid library UB:
By defining library UB as only rigid library UB (thus malleable library UB are not library UB), then we have equivalence between library UB and language UB. Footnotes
|
I'm not sure if this has been addressed, but the term "language UB" seems to imply that UB caused by other languages would not be (rust) language UB. For instance if you use FFI to call into some c++ code, which then hits UB, is that language UB from the rust point of view? any library that exposes this FFI-based code would have to document the code such that it'd be library UB to hit that, so it would definitely be library UB at least. |
UB through FFI is UB (aka language UB): https://doc.rust-lang.org/reference/behavior-considered-undefined.html (search for the first occurrence of FFI). It's a good point that this could be confusing, but I'm not sure if it's confusing enough in practice to justify renaming language UB (which should really just be called UB). |
I still don't see how this distinguishes between language and library. As people have said already, you can audit and pin the compiler, or have an out-of-band contract with the compiler. I'll add on that this is not theoretical, but a fairly common occurrence. See @RalfJung's list of crates deliberately using UB. Or this random example from the standard library. Or as a broader example, allocator implementations and how they interact with the aliasing model; every allocator is committing at least LLVM-level UB. Well, in practice it's less "pin the compiler" and more "hope upgrades don't break anything". Less "out-of-band contract" and more "vibes on the unsafe-code-guidelines repo". But empirically, that's been good enough: many of these deliberate uses of UB have been around for a long time without being broken by compiler upgrades. In that sense, language UB too can be "malleable". To be clear, having deliberate uses of language UB is an undesirable situation, and I greatly appreciate the work being done to find better alternatives. But you could say the same about deliberate uses of library UB. |
Yes, I'm among them: #539 (comment).
Thanks for the link, you're fixing #539 (comment) and confirming what I'm saying.
I was supposed to have addressed this in #539 (comment) with the My problem was that I was assuming "UB" to mean "language UB" in "library UB" (i.e. library language UB or language UB at library level), while it's actually a different concept. So let me try to add 2 more definitions to the definitions of safety invariant, validity invariant, safety requirement, and soundness invariant I gave in #539 (comment):
I believe those names are highly confusing and should change. They are more confusing than validity and safety invariant, which I don't find that problematic. My suggestions would be UB for language UB like every other language does, and safety violation for library UB (or something along those lines, no strong opinion except that it should not mention UB).
This actually hints at another distinction between UB and safety violations: formalism. We have 2 dimensions going from completely informal (0) to fully formal (1):
Although it doesn't matter for what follows, I'm going to assume that Y ≤ X and thus focus on the bottom right triangle instead of the whole square. My impression is that today we have X=0.5 (there is a concrete ambition to reach a formal specification of the language, and there are already progress artifacts like MiniRust, Miri, and subset languages in research) and Y=0.1 (English prose only but trying to use a common vocabulary). In particular we have Y << X. The utopia world I described in #539 (comment) is at coordinates X=1 and Y=1. In this world, https://rust-lang.github.io/rust-project-goals/2024h2/Contracts-and-invariants.html is a thing and library authors can write attributes1 to specify their library. We have the same level of formalism for both UBs and safety violations. I believe that saying that UB and safety violations are the same may give the impression that Y = X, i.e. they have the same level of formalism, which is currently not true. Footnotes
|
To complete the naming suggestions (using the definitions I gave for the current names, using
|
I think that "library UB" is an OK name for what happens when calling a library function in a way that there are no requirements placed on the library's behavior [which means the library is allowed to invoke language UB] (for safe functions, this can only happen if you pass a parameter for which the type safety invariant does not hold - tho the converse does not hold - it is possible to document a function so that it has defined behavior even if there are some parameters for which the type safety invariant does not hold). [As opposed to the state where you merely have a value that violates some library's safety invariants, but are not passing that value to that library].
The relationship is that the safety invariant of a type is the default library invariant of functions taking that type as a parameter. If you have a type like |
This comment has been minimized.
This comment has been minimized.
Actually found a way to make language UB and library UB fully parallel as advertised by #539 (comment). One needs the following definitions:
By this definition, language UB is equivalent to UB (undefined behavior) as before. But we can now give a proper meaning to "may" in "language UB may cause an AM fault" from #539 (comment) which is "language UB may cause unexpected behavior" (or equivalently "undefined behavior may cause unexpected behavior"). Similarly, library UB may cause unexpected behavior too. But it may also cause undefined behavior. While language UB will always cause undefined behavior because that's its definition. To sum up:
So language UB and library UB are fully parallel only if we interpret UB as (possible) unexpected behavior instead of undefined behavior in their names. If we interpret it as undefined behavior (as I naively did before), then those are different concepts. |
Now that things are clearer to me, and given how I polluted the thread, I feel responsible for summarizing the current status of definitions and names. Language UB and library UBThe definitions are given by @CAD97 in #539 (comment). They are "violating a safety condition documented by the language (resp. library)". I believe there is agreement on those definitions. The names are still in discussion:
Validity invariant, safety invariant, and soundness invariantThe definitions of validity and safety invariant are given by @RalfJung in https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html:
Those definitions are only "at type T" regardless of where the type occurs in the program. I think a possibly missing definition is what I call anonymous types in the unsafe mental model and what I called soundness invariant in this thread. That's needed when proving soundness with a type system rather than a verification tool, although that's mostly a style concern. I give a definition in #539 (comment) along with the 2 other invariants:
The names are still in discussion:
I hope I didn't miss some of the naming discussions. |
Oops, I should have noticed that.
True!
If you assume a specific implementation, then there are more layers below that. As you say, library UB sits on top of language UB, but below that there is MIR UB, then LLVM IR UB, then operating system UB and/or CPU architecture UB. At each layer, UB can result in either UB in the next layer, or defined-but-undesired behavior at that layer. On some older CPUs it might go all the way down to transistor-level UB. |
Yeah, I don't think it's useful to distinguish between library UB and language UB. You can pin libraries that aren't the standard library1 to do on-paper UB things that are fine in practice, but you can also pin your compiler to the same objective. The library doesn't even need to upgrade the library UB to language UB to get UB-esque outcomes either - it can do its own arcane invocations to summon nasal demons. The reason it can be useful to distinguish between safety invariant and validity invariant, is a safety invariant can be reliably violated without consequence provided you don't use the value with a broken invariant as that type (including dropping it). You can't violate a validity invariant, even transiently, without causing UB. Footnotes
|
One of the benefits of the terms "library UB" / "language UB" is that lib UB is not lang UB. Lib UB may lead to lang UB, but they are disjoint concepts. I'm not particularly fond of "safety violation" as the term for communicating lib UB, as causing lang UB is also a safety violation. The split between "safety" and "soundness" is also relevant. The safety condition (whether invariant, precondition, or otherwise) is the documented condition of a type or operation that must be upheld by So I think we can/should use all four terms, actually, at least when including your mental model documentation. The definitions I'd use:
(Although note that these definitions rely on the technically informal assumption that despite UB time traveling, it cannot time travel over an observable sequence point (side effect / IO) of the program to be fully formally accurate. (A compiler would likely need to violate actual causality1 to violate this property.) This still preserves all other "time travel" effects of UB.) "Validity" isn't an ideal term because of how overloaded "validity" already is, e.g. the current stable std documentation talks about pointer validity2. It's also important to note that "soundness" isn't a useful term without being clear about the scope; is it parametrized by a single execution or all possible executions; with a specific version of implementation details or all valid choices of implementation details? Does this phrasing match your expectations/needs, @ia0? The one thing I think you brought up that this framing doesn't acknowledge is the concept of "safety conditions" which are expected to be opened temporarily. But I think that this distinction doesn't come from the term used to classify the requirement, but whether it is the requirement of a type or function3; the type's invariants are not enforced (cannot be violated) while the type is at rest (untyped memory), but only upon execution of a function (or for language invariants, a typed copy). That leaves space for e.g. robust functions which syntactically take The one wrinkle is that the language potentially defines a "library invariant" in that the vtable part of a raw dyn pointer needs to match the expected pointee type for pointer upcasting, which is a safe operation. (Current leaning is, IIRC, that the vtable part not be required to be valid for access until a dynamic call is made.) More precisely, it would be a language defined precondition on the operation, rather than an invariant attached to the type (thus all typed copies), but because the operation is safe, it's a safety condition relevant to safe code as well as Footnotes
|
It actually does not for the simple reason that if such is never sent, nothing with undefined behaviour ever occurs. However, if you did |
I very often say "specification" to mean "soundness-related specification". I'm clarifying that upfront because it may be confusing if one understands "specification" to mean "correctness specification".
In that case, what about:
The problem I have (and I believe @digama0 too) is that UB is a language specification concept, not a program specification concept. Talking about "library UB" is mixing program specification and language specification. So we have to choose between:
How can an API be executed? I don't think extending the notion of soundness to APIs or specifications is useful. You mention this later that soundness is abused, I agree. You can see my reply there.
I'm assuming you meant "without having one of its safety condition violated first". And I would tend to push back against this definition, because soundness should only be about UB, and here you extend the notion of soundness to library safety conditions, not just language safety conditions. That's the same issue as I presented above. I think this is the crux of the problem. Do we want to extend UB and soundness to something else than just the language semantics? I'm quite strongly opposed as this would deviate from their usual meaning. The way I would formulate your sentence would be: An implementation is sound if it doesn't do any of the following without having its safety condition (as defined by its specification aka API) violated:
I don't think this is a useful concept because it suggests that it is fine to violate library safety conditions. Also note that this is a different concept than what I mean by soundness invariant (more precisely soundness type occurrence invariant). Soundness invariant is simply the safety condition of a library encoded into the types occurring in the library API. This is just a matter of style. What matters is to understand that safety condition is a different concept than the safety invariants of the types occurring in the library API.
This assumption can be made formal. It simply follows from the fact that the program specification is not part of the program, in particular the compiler doesn't know it, so it must assume that it can be arbitrarily specific. To take your example, if the program specification is that standard input should never contain a newline character (or whatever
I agree with this and it's indeed a good argument to try to change that name. The thing is that it's very convenient to talk about valid and invalid values, which I do in the unsafe mental model. But as you pointed out, in practice we usually only talk about pointer reads or writes being valid, not really about a value being valid. I checked the documentation of
I agree that soundness is a bit abused. Technically it is a property of a predicate over specified programs such that programs satisfying this predicate don't have undefined behavior for all their executions satisfying their specification. One way to define such predicate is with a type system. But you can also use static analyzers. Such static tools are sound if the predicate they define has the soundness property. The completeness property is the opposite direction. If a program doesn't have undefined behavior for all its executions satisfying its specification, then the predicate should hold on this specified program. One way the word is extended is to talk about sound programs to mean that they don't have undefined behavior for all their executions satisfying their specification. Another is for sound libraries to mean that they implement their specification, don't have undefined behavior, and don't violate the specification of other libraries as long as their own specification is not violated. So to answer your 2 questions:
I'm not sure what you mean by that. If it's about the example you gave at the end of the paragraph:
Then this is not about "safety condition" being opened/broken but "safety invariant" being opened/broken. A robust function taking
It's a matter of style, so some may find it more convoluted and others may find it more intuitive. That's fine. I also don't think it's a necessary step for more formal reasoning. This is just a matter of style whether one prefers to prove a program sound through a sound type system (where properties are in the types), a sound verification tool (where properties are on the side), or a combination of both. |
I think we basically agree at this point except for, IIUC, this big mismatch in application of terms:
AIUI:
As a particular example: that the I hold that my use of the terms that an invariant is a condition is the common understanding, and you're the odd one out to define «safety condition» as a looser constraint than «safety invariant» is1. Everything else at this point is we're talking past each other due to that terminology mismatch. I think I need to let @RalfJung or someone else say their part now. But some specific points:
This is very much an uphill battle, given how std docs claim any violation of its documented safety requirements to be UB, despite many only being library UB.
Then what does it mean that “
If you include the implicit safety condition of "all of the safety invariants of the types I take hold" as one of its safety conditions, and not a safety condition of whatever library defined that type, then yes.
Because I think being sound in the face of But in retrospect I should've also added the implicit second clause of “and it cannot violate a soundness condition without having another soundness condition being violated first,” which may have made the difference between the two clauses more apparent.
It is acceptable for a library to open/violate its own safety invariants, precisely because being within the scope of the same soundness proof enables it to separate safety requirements from soundness requirements. This merely puts the program state into an It is not acceptable for code to violate its upstreams' safety conditions, because the library guarantee only extends to the safety conditions being sufficient for soundness. That soundness conditions may be weaker than the documented safety conditions is a private implementation detail that may change in an update. But distinguishing between library safety violations and language UB is still useful, because automated tooling can only ever diagnose the latter, unless assertions are manually added.
And I expect we still will. It will just have a contextual meaning of satisfying the relevant requirements rather than specifically whether bytes are a valid instance of the syntactical language type shallowly (validity invariant). I.e. “accepts any valid
I was talking about implementation details that are encapsulated away from the analyzed code, thus “don't violate the specification of other libraries” means soundness is parametric over all encapsulated implementation details (or more precisely, it is based on its upstreams' soundness specification only, which must not be loosened to be sound in fewer cases by an update). Footnotes
|
Exactly, so I'm putting the 2 main contention points below. I'm replying to your comment at the end of this comment. Contention points that need agreementAgree on what UB should meanThere are 3 competing definitions:
How can we come to an agreement here? EDIT: Adding name to those options so we can refer to them:
Agree that library UB is not about safety invariant
The safety condition of a library is the safety invariant of the types of its API modified by its safety1 documentation. The soundness invariant is the name of the modified invariants. Each type occurrence may be modified differently, so the soundness invariant does not apply to a type but to a specific occurrence of a type (in the library API). Note that modifying the safety invariant is not necessarily a refinement (strengthening), it can also weaken the invariant up to the validity invariant. This is why in the unsafe mental model I start from the validity invariant, such that only refinements are used. Replies to @CAD97
I'm actually saying something else. The soundness invariant (which I'm fine to call safety condition, although I would prefer safety and robustness condition) is not comparable with the safety invariant. This is because it refines the validity invariant, not the safety invariant. There are 4 options:
Yes, because I don't have the same definition as you for soundness requirement (I would prefer the term "soundness-related specification" or "specification for the purpose of soundness"). With your definition (looking at the implementation to come up with a specification instead of looking at the specification) I agree on the sub-typing relation (assuming we only care about the requirements part and not the guarantees part). So I think we agree here. And I'm saying that we shouldn't talk about your notion of soundness requirement, because it's legitimizing violating library UB, by saying who cares about the safety condition the library author wrote, just read the implementation.
No, it's more:
I agree with this statement, but I don't think it's a useful observation. This is the notion of malleability which I don't think is good to mention too much and in particular we shouldn't name it publicly. People should simply not violate contracts. If they do, they should come up with a story themselves. We shouldn't provide them the tools and make their life easier. We'll end up in a very bad ecosystem. Such behavior should be exceptional.
That's correct. I'm distinguishing safety condition from safety invariant (there's no real subtyping relation between them, it can go both ways when we consider robustness, which is necessary for composable proofs). I'm adding this to the list of contention points that need agreement below.
It's because this API also has a correctness specification. It's not enough to implement the type and safety documentation. You also care about the non-safety documentation. Otherwise the implementation that does nothing is trivially sound. I guess we could say an API cannot be sound if there are no sound implementation satisfying both the correctness and the soundness specification of the API.
Yes, for me the safety condition covers both the type invariants and the safety documentation. In particular I distinguish the safety documentation from the safety condition. That's probably why I prefer to call the safety condition the soundness-related specification or the soundness invariant.
I guess that's where there is a misunderstanding. For me, violating the safety condition of a library (called library UB) is forbidden, the same way language UB is forbidden. And it's not forbidden because library UB may result in language UB which is forbidden, it's because violating a safety contract is forbidden full stop. It doesn't matter if it's immediate language UB or not. That's exactly why I don't like your notion of soundness requirement which opens the door to such behavior. If you want a healthy ecosystem, people should not violate contracts (in particular safety contracts). Library UB is how you prove soundness in a compositional way. It's important to understand that soundness is not completeness. It's perfectly fine to be over-restrictive and reject programs without undefined behavior because some contract is too strict.
I agree with this statement and the other ones in the same block. But I would formulate it differently. You're not really doing library UB in that case, because you have access to a different safety contract, an internal one. I wouldn't call this library UB, otherwise it might suggest other people that it's fine to have library UB. Footnotes
|
The UB confusion can actually be summed up in a picture. Picture in the syntactic world
Picture in the semantic world
Parallel linesWe have 2 pairs of parallel lines: those refining SV and those possibly causing IB. That's why we can say that library SV and language SV are fully parallel. However, UB distinguishes between them. In particular:
ConfusionThe confusion is the result of the interpretation function from syntax to semantics being not well-defined. The best I can come up is:
Footnotes
|
There is ample precedent for calling things "UB" that are library UB; the entire C++ standard library is specified that way, and as mentioned above the Rust standard library docs also use "UB" to refer to more than just language UB. So I don't agree with your framing here implying that I would not be preserving the usual meaning of UB. I think I am in fact preserving its existing meaning. Or rather, I am disentangling what is usually just called "UB" into two subclasses: "language UB" and "library UB". Replying to one point by @CAD97:
I don't think this is a special case in any way. Yes, the language invariant is not sufficient to justify soundness of a primitive operation here. But that's completely fine, and happens elsewhere as well:
The language invariant is not the invariant relevant for a type safety proof. That would be the safety invariant / library invariant. (There's a reason I used "safety" here in my original blog post. ;) So I don't think this observation is surprising or concerning. |
Thanks! This answers the first point and gives a meaning to the notion of "program UB". For example:
This leaves us with the second point and the topic of this issue. There is no parallel between language/library UB and validity/safety invariant. More precisely, this parallel only holds outside the scope of unsafe. So if we use "language invariant", we shouldn't use "library invariant" to avoid confusion. Assuming everybody prefers "language invariant" for validity invariant, I can see at least 3 options for safety invariant:
Footnotes
|
By the way, isn't the C++ standard library part of the C++ language? Aren't implementations allowed to look at symbol names and if they recognize something from the standard library optimize based on the specification of this symbol by the language? If yes, I'm not convinced by this part of your argument. However, I'm convinced by the other part, which was also mentioned earlier: "This ship has unfortunately sailed long ago" (source). |
I don't agree; I still think there is a completely fine parallel here: generally, violating the library invariant without violating the language invariant risks library UB but does not immediately risk language UB.
I guess since they don't make their AM explicit one can't really say... but I don't think anyone knows how to make all those constraints operational so I don't think there is a reasonable way to make all of this AM-level UB. |
The reason I believe the distinction between safety invariant and "soundness invariant"1 matters, is that type invariants only matter when writing unsafe code (otherwise the type system is here), and unsafe (or robust) public APIs are where those invariants may differ, so it's important to distinguish them. But I agree that:
Footnotes
|
That's not an invariant, that's just the precondition of a particular function. I don't think we need to come up with a new term for this very well-established concept. |
They're both part of the standard, but it really means what you mean by "part". For C++23, the standard library is defined in section 17. It's roughly analogous to "is libcore part of the rust language?". Clause 17 standardizes the "language support library," which does the rough equivalent of defining what we refer to as "lang items" here in Rust.
Yes. A famous example here is how memory allocation isn't considered observable behavior, so this code: #include <new>
int foo(){
int *five = new int;
return 42;
} will compile to foo():
mov eax, 42
ret (This is also the case if you use malloc instead of new) I don't have any strong opinion about if this has any implications on how Rust defines UB, just figured I'd add some context here. |
Those can be seen as the same thing, it's a matter of style. One is "contracts as types" and the other is "contracts as pre-posts". For example, RefinedRust has annotations in the "contracts as pre-posts" style but is actually implemented with a "contracts as types" style. A concrete example would be: /// Randomizes the content of a slice.
///
/// # Safety
///
/// `ptr` must be valid to write for `len` bytes
pub unsafe fn randomize(ptr: *mut u8, len: usize); In this example, I'm considering In the "contracts as pre-posts" style, we have a lemma to prove by the implementer and to use by the callers:
In the "contracts as types" style, we have a type in a richer type system:
In particular, the invariant of the occurrence of It's important to note that type occurrences constrained by the safety documentation refine the validity invariant, not the safety invariant (in particular, the safety invariant is the default refinement of the validity invariant). This matters to express the type of
We can unfold the definitions of
If
Indeed we don't. It's just a matter of style, which is why I said "This is somehow a matter of style, so not that important". Some people like to prove properties of their programs using pre-posts, others using types. We don't need a new term, but we need to understand the concept to understand when the parallel between language/library UB and validity/safety invariant does not hold. You can violate the safety invariant without library UB (robust functions) and you can get library UB without violating the safety invariant (unsafe functions). While violating the validity invariant is always language UB (the reciprocal is true when restricting to the type invariant section of the language contract). Similarly, violating the "soundness" invariant is always library UB.
Indeed, if one can split the language specification |
I strongly disagree with this. You are causing a completely unnecessary terminology confusion here. Using "invariant" for "has to be true when this type is passed across API boundaries (but exceptions apply within a function)" is fairly standard (in safety/library invariant), even wikipedia calls this an invariant. We are already stretching the notion of "invariant" by using it for things that have to be true "on every typed load" (in validity/language invariant), but in some sense this is similar to the former use of invariant -- it's something that has to be true for all operations on this type. But calling the precondition of a specific function an invariant is unheard of AFAIK and I will strongly push back against that. We use words that have specific technical meaning for a reason. |
I'm not calling the precondition an invariant. In the "contracts as types" style, there are no preconditions, there are just types. Pre- and post-conditions only exist in the "contracts as pre-posts" style. However, I claim there is a parallel between them, which can be seen with those 2 sentences:
Given you mention the "on every typed load" particularity of the validity invariant, it's also interesting to see that this is also true for the "soundness" invariant. You can annotate your program such that every typed load satisfy the "soundness" invariant (which implies the validity invariant). This gives us the parallel you want:
The difference with the validity and safety invariants, is that the "soundness" invariant respects subtyping (hence can be used to prove the program or library sound in a type system fashion). One never casts a value from T to S if the "soundness" invariant of T does not imply the one of S. While one can do so with the validity and safety invariant. You can cast from |
The difference here, AIUI, is that when @RalfJung or I say «type», we are referring to the syntactic type that is written (or inferred) in the Rust source code. Whereas @ia0 is using «type» to mean roughly the specific-to-an-API-signature semantic refinement of the «validity invariant» which captures all of the conditions for the soundness proof of that API signature. Warning Disclaimer: I worded this a bit authoritatively, but it isn't intended to be final, nor is it the expressed opinion of anyone in the mentioned groups other than myself. @ia0 — the work you've put in developing your unsafe mental model document is appreciated, and I do believe it's a useful resource; however, it is not shared vocabulary with T-opsem or the UCG. Surface Rust, MIR, nor MiniRust have any concept of refinement (sub)typing below the syntactical (language/validity) typing, so the attempt to use your modeling of API contracts as refinement types to influence the common language which doesn't utilize refinement such refinement typing fails to translate properly. I think we've identified the ideal resolution here:
Standard documentation can use “language invariant” and “library invariant” to refer to the (syntactic) type requirement that is required for a typed copy1 or for other safe API2 to be actually safe to use, respectively. And your doc which extends Rust's unsafety model with a more formal reasoning model for managing the unsafety in a program can go into refinement typing for attaching unsafe contracts on top of syntactic types, and how under this model the “library invariant” is just the default refinement for a syntactic type unless otherwise specified. Also keep in mind that the terms discussed here are very much intended to be used as the “surface level” terminology that gets seen by everyone who writes Rust and uses the stdlib documentation. It's important for it to not be wrong so the path to the full details of safely using And FWIW, I think “library invariant (of the syntactic type)” is a perfectly fine semantic for your model to replace with “safety invariant (of the specific flow-dependent type refinement at this point of the execution)” as your model is fully about replacing the syntactic type with the semantic type used for soundness reasoning. Footnotes
|
It's been more than six years since my blog post that introduced these terms, and I think it is clear at this point that the concept of these two kinds of invariants is here to stay. We should start using this concept in stable docs, so that we can write clear docs.
The main thing that gives me halt here is that I am not happy with the terminology. Personally I regret calling them "validity invariant" and "safety invariant"; I now think that "language invariant" and "library invariant" are better terms. They go along nicely with the terms "language UB" and "library UB" that we have also been using already. I don't feel comfortable pushing for using these terms in stable docs until we have this resolved. We had #95 open for a while to find better terms, but that got closed as part of our backlog cleanup.
@digama0 was also recently confused by the fact that we call these "invariant" when they are not strictly speaking invariants that hold on every step of execution -- they are more invariants that hold about values at given points in the program. In verification we also talk e.g. about "loop invariants" that only hold at the beginning (or some other fixed point) during each loop, not all the time during the loop, and at least in Iris our "invariants" can be temporarily broken while they are being "opened" (and closed again within the same atomic step). The work on the anti-frame rule also uses the term "invariant" for things that are true between the operations on a data structure, but not during an operation. So IMO it is justified to use the term "invariant".
@rust-lang/opsem I'd like to commit to using the terms "language invariant" and "library invariant", and then start using those in the reference and stable docs. What do you think?
The text was updated successfully, but these errors were encountered: