Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize having the concept of "validity invariant" and "safety invariant"? Under which name? #539

Open
RalfJung opened this issue Oct 18, 2024 · 71 comments

Comments

@RalfJung
Copy link
Member

It's been more than six years since my blog post that introduced these terms, and I think it is clear at this point that the concept of these two kinds of invariants is here to stay. We should start using this concept in stable docs, so that we can write clear docs.

The main thing that gives me halt here is that I am not happy with the terminology. Personally I regret calling them "validity invariant" and "safety invariant"; I now think that "language invariant" and "library invariant" are better terms. They go along nicely with the terms "language UB" and "library UB" that we have also been using already. I don't feel comfortable pushing for using these terms in stable docs until we have this resolved. We had #95 open for a while to find better terms, but that got closed as part of our backlog cleanup.

@digama0 was also recently confused by the fact that we call these "invariant" when they are not strictly speaking invariants that hold on every step of execution -- they are more invariants that hold about values at given points in the program. In verification we also talk e.g. about "loop invariants" that only hold at the beginning (or some other fixed point) during each loop, not all the time during the loop, and at least in Iris our "invariants" can be temporarily broken while they are being "opened" (and closed again within the same atomic step). The work on the anti-frame rule also uses the term "invariant" for things that are true between the operations on a data structure, but not during an operation. So IMO it is justified to use the term "invariant".

@rust-lang/opsem I'd like to commit to using the terms "language invariant" and "library invariant", and then start using those in the reference and stable docs. What do you think?

@joshlf

This comment has been minimized.

@RalfJung

This comment has been minimized.

@arielb1
Copy link

arielb1 commented Dec 18, 2024

@rust-lang/opsem I'd like to commit to using the terms "language invariant" and "library invariant", and then start using those in the reference and stable docs. What do you think?

I don't like these. Many library invariants are not safety invariants (e.g., if you have a non-deterministic Hash function, you are very likely to not find your items in an HashMap, but nothing unsafe will happen), and some language invariants are safety invariants (e.g., if you have aliasing &mut references, then as long as you don't use them nothing bad will happen).

So there's at least:

  1. abstract machine invariant - if you violate this demons might fly out of your nose, immediately do not pass go, even if you would immediately afterward call _exit(0) without calling any intermediate library function.
  2. library non-safety invariant - if you violate this, the library might do something safe but undesirable (e.g. a non-determinstic hash on an HashMap)
  3. library "validity" invariant - if you violate this and call a library function on that object, you might have demons flying out of your nose. But if you don't call a library function on that object, nothing bad will happen. e.g. I believe non-UTF8 &str falls here.
  4. safety invariants - if you violate this, you can write more or less carefully-crafted safe code that will violate an abstract machine invariant [but if you don't write that particular safe code, the program will do something very well-defined and possibly useful].
    4.1. including fully library safety invariants like an Rc with an invalid reference count - which should not cause UB as long as it doesn't unexpectedly reach 0.

I think that "safety invariant" is a good name, but "validity invariant" is not a particularly good name for "if you violate this demons will pop out of your nose". tho maybe "language invariant" is a good name (so language invariant/safety invariant).

@Lokathor
Copy link
Contributor

I think any two names that do not share the first letter are better than two names that do share the same first letter.

So I vote for validity/safety over lang/library. But also any two other words of differing first letter would be fine too: lang and safety, for example.

@RalfJung
Copy link
Member Author

RalfJung commented Dec 18, 2024

Many library invariants are not safety invariants (e.g., if you have a non-deterministic Hash function, you are very likely to not find your items in an HashMap, but nothing unsafe will happen), and some language invariants are safety invariants (e.g., if you have aliasing &mut references, then as long as you don't use them nothing bad will happen).

Well, you've made your own choices here by calling these "library invariants" and "language invariants"; choices I would not agree with. ;)

So I vote for validity/safety over lang/library. But also any two other words of differing first letter would be fine too: lang and safety, for example.

So do you propose we also call it language UB and safety UB? That doesn't make a lot of sense, and IMO it'd be good to align those terms with the invariants that cause the respective UB when violated.

I don't recall us ever abbreviating validity/safety invariant with the first letters, so I don't think it's very important that they have different first letters.

@Lokathor
Copy link
Contributor

But there isn't actually safety UB is there? All safety invariants are there because when violated they can allow otherwise fine code to end up breaking a language rule somewhere later. As far as I'm aware, there's just the one kind of UB.

@RalfJung
Copy link
Member Author

RalfJung commented Dec 18, 2024

Library UB is real and not the same as language UB. For instance, it is library UB to call any &str method with a non-UTF-8 str, but it is language UB only for some operations that happen to rely on the UTF-8 invariant in a way that triggers language UB (i.e., a Miri error) when the invariant is violated. Future library updates may change existing methods so that more (or fewer) of them have language UB when invoked with a non-UTF-8 str.

@zachs18
Copy link

zachs18 commented Dec 18, 2024

As far as I'm aware, there's just the one kind of UB.

They go along nicely with the terms "language UB" and "library UB" that we have also been using already.

(From the OP)

The UB you are referring to is also called "language UB", and "library UB" is used in contrast to mean basically "a violation of library invariants/preconditions/etc that could lead to (language) UB depending on usage/library implementation details/etc", which is useful to have a succinct term for.

@digama0
Copy link

digama0 commented Dec 18, 2024

So do you propose we also call it language UB and safety UB? That doesn't make a lot of sense, and IMO it'd be good to align those terms with the invariants that cause the respective UB when violated.

I would call them "language UB" (or just UB as I think there is only one thing that should get that name) and "safety violation" respectively.

@Lokathor
Copy link
Contributor

Yeah, see, I'm in agreement with @digama0. If you have a &str and unsafely set the bytes to not be utf8, then that could lead to UB, but it's not actually UB the moment the bytes are wrong.

I don't think we should call that state of having invalid bytes "library UB". I don't think we should call any other similar situations "library UB" either. I do like the term "safety violation".

Otherwise you have to start explaining to people that sometimes there's something called UB that they're allowed to do anyway, as long as they fix the situation before anything "really bad" happens. And that's a very easy thing for people to misunderstand, or only halfway remember months after the fact.

@RalfJung
Copy link
Member Author

Violating such library preconditions is often called UB, and I think it makes sense. UB basically means "you violated the contract of this API and now all guarantees are void". Language UB is about the contract between the programmer and the compiler (where the "API" is the language itself); library UB is about the contract between the library author and the client author.

@Lokathor
Copy link
Contributor

Right, except I'm telling you that what people often end up hearing is, "I'm allowed to do some types of UB".

I think that's bad, and if we're going to deliberately pick terms, we should deliberately pick terms that don't let people think that "sometimes UB is okay". Because that's a real thing people have said to me, and I've had to stop and explain to them what's "actually UB", and that they really can't do it, not even sneakily.

So call them lang invariants and lib invariants if you want, but i strongly suggest that you don't call them that if you want to draw a connection to "lang ub" and "lib ub", because those terms are not good terms that i have seen do harm to people's understanding of rust on multiple occasions over the years.

@RalfJung
Copy link
Member Author

You're not allowed to do library UB with other people's libraries either. I have seen this analogy work pretty well, too, so I wouldn't be so quick to discard it.

@Lokathor
Copy link
Contributor

Lokathor commented Dec 18, 2024

I know it does work for some people, but if it works for some people and not for others, then I don't think it entirely works.

Of course you should not do any library UB, but the fact that you can actually do it, without complaints from miri (eg: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=919353ecbc1b830d3f8daa6f3180888e), gives some people wrong impressions about what terms even mean. I don't think we should design our terms just for experts, we should also design them for the uninformed and forgetful too. There's a lot more uninformed and forgetful programmers than expert ones.

@ia0
Copy link

ia0 commented Dec 18, 2024

I agree with @Lokathor that people do library UB and the reason is what I call "robust functions" in the unsafe mental model and what you call "super safe functions". As a library author you can document that you accept more values than what the safety invariant says. Typical examples are non-UTF-8 str and pinned &mut T. You're not breaking the safety contract with the library author, just the safety invariant of the syntactical type. But that's just the default interpretation, not the one used for soundness.

@RalfJung
Copy link
Member Author

@Lokathor Nothing works for all people, it's a matter of tradeoffs. 🤷

I have registered your opinion, let's see what other people say.

@ia0 Library UB is a violation of the contract between the library author and the client. If the library author documents that it's okay to call some &str function with non-UTF-8 data, then doing such a call does not violate the library contract, and therefore this is not library UB.

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

I think we all agree that if we use the term UB it should be for situations that always indicate incorrect code. ie. it's never ok for a program to exhibit something we have decided to call UB.

It seems the argument is more over whether violating library invariants should always indicate a problem with the program, or whether there are some situations where violating library invariants is acceptable - if there are such cases then we shouldn't call it UB.

I tend to agree with @RalfJung that it should be considered UB as long as we:

  • Are clear that library UB is still UB and always indicates incorrect code.
  • Are clear that MIRI only detects a subset of UB (presumably with the goal of detecting all language UB).

There are still some problems though:

  • It adds a lot more kinds of UB that can't be detected via tools like MIRI.

  • Library UB is determined by the library's public interface with the world - that is, when we are working inside the libary the concept of library UB for itself doesn't really exist. For example, a library may say that calling foo(0) is UB, but may call foo(0) inside its own implementation, using the knowledge that (in the current version of the library, under these conditions) it will not cause language UB.

    It's unclear to me how to encode the conditions under which a function call is subject to library invariants.

    If I have two related crates (eg. foo-core and foo) can I have foo bypass library invariants of foo-core because it is aware of foo-cores implementation details?

    We'd potentially need some notion of "open" or "closed" when depending on libraries to determine whether calls into such a library are subject to its invariants.

    As for why this would crop up - as a library author I probably want to have a simplified contract with the outside world - ie. instead of saying its UB to call foo(0) on a tuesday, I might want to just document that it's always UB to call foo(0). It makes it simpler for users to understand and it gives me more flexibility to change the implementation.

@ia0
Copy link

ia0 commented Dec 18, 2024

Library UB is a violation of the contract between the library author and the client. If the library author documents that it's okay to call some &str function with non-UTF-8 data, then doing such a call does not violate the library contract, and therefore this is not library UB.

Ah ok my bad. I thought the analogy between "validity invariant" and "language UB" on one side and "safety invariant" and "library UB" on the other, was stronger1. I fear this will bring additional confusion to the confusion @Lokathor is mentioning2.

Footnotes

  1. "language UB" is when the "validity invariant" of a type is broken, but "library UB" is not when the "safety invariant" of a type is broken, but when the "user-defined invariant" of a type is broken (which is the "safety invariant" by default if there's no additional documentation).

  2. Namely that "library UB" is not as bad as "language UB" (because breaking the "user-defined invariant" may not immediately break soundness), and most people will confuse both concepts and use the properties of one (library UB not a big problem) to justify the other (language UB, which is actually a big problem).

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

Namely that "library UB" is not as bad as "language UB" (because breaking the "user-defined invariant" may not immediately break soundness)

This is not a real distinction.

  • Causing language UB may not immediately cause problems. But it might do on compiler upgrade or optimization change, etc.
  • Causing library UB may not immediately cause problems. But it might do on library upgrade.

The parallel between language UB and library UB is actually quite strong.

@ia0
Copy link

ia0 commented Dec 18, 2024

  • If I have two related crates (eg. foo-core and foo) can I have foo bypass library invariants of foo-core because it is aware of foo-cores implementation details?

I fear this is a separate discussion (so probably better opening a thread on zulip). You're always allowed to look at the implementation of a crate. Simply if you rely on it for soundness (or correctness if you care about that too), then you need to pin the version you audited with the = Cargo version requirement.

  • Causing language UB may not immediately cause problems. But it might do on compiler upgrade or optimization change, etc.

If you do this, you're breaking the contract of the compiler. If you have language UB, your program is undefined (literally, you can't say what it does) and there's no guarantee on the output of the compiler (it may not terminate, it may crash, it may produce a buggy program, or even a program that seems to "work" until an attacker figures out it doesn't).

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

You're always allowed to look at the implementation of a crate.

Am I also allowed to look at the implementation of the compiler? What about libstd?

If you do this, you're breaking the contract of the compiler. If you have language UB, your program is undefined (literally, you can't say what it does) and there's no guarantee on the output of the compiler.

And if you break a library invariant you're breaking the contract of the library. Literally you can't say what the library will do. And there's no guarantee on the output of the library.

Both the language and the library provide a contract with the user. There's an exact parallel between them.

If I pin the version of the compiler and exploit its implementation details, I can be confident in what it will output. I won't be writing "valid" Rust code anymore because I broke the contract of the language.

In the same way if I pin a version of a library and exploit its implementation details, I can be confident in what it will output. I won't be a "valid" user of that library anymore because I broke the contract of the library.

@chorman0773
Copy link
Contributor

Am I also allowed to look at the implementation of the compiler? What about libstd?

You are allowed to look at both and theoretically could pin the (rustc, cg backend, llvm/cranelift/gcc} triple, and do whatever the heck you want (including language UB) if you know what the result will be (and that result is desirable), yes. But as you alluded to, you would no longer be a valid user (of either Rust as a language or rustc as a compiler).

@ia0
Copy link

ia0 commented Dec 18, 2024

I won't be a "valid" user of that library anymore

But you never were. Your initial question was when the same user writes 2 crates: foo-core and foo. The crate foo can look at the implementation of foo-core and pin it. Other users of foo-core need to follow the public API of that crate (unless they have an out-of-band contract with the crate author).

You're not writing the compiler, so that's a different situation.

To push the example further, the standard library authors are somehow the same (ok not really but close enough, and in particular they have an out-of-band contract) as those of the compiler, which is why the standard library is allowed to use language UBs that normal Rust users can't. (This is probably only tolerated because the actual language UBs are not stable yet.)

@digama0
Copy link

digama0 commented Dec 18, 2024

To push the example further, the standard library authors are somehow the same (ok not really but close enough, and in particular they have an out-of-band contract) as those of the compiler, which is why the standard library is allowed to use language UBs that normal Rust users can't. (This is probably only tolerated because the actual language UBs are not stable yet.)

Actually I think std doesn't use UB (or at least we try not to), std fixes some underspecified language nondeterminism like struct layout (which is not UB if you "guess it right"). (Queue Jubilee telling me about unspeakable horrors in std or compiler-builtins.)

@chorman0773
Copy link
Contributor

It doesn't, though it 100% could from the perspective of an implementor. It's called implementor's privilege. Whether or not it should from a design perspective of course is a separate thing.

(Also, nit, struct layout isn't really underspecified - it's unspecified, intentionally)

@Diggsey
Copy link

Diggsey commented Dec 18, 2024

Your initial question was when the same user writes 2 crates: foo-core and foo.

No, that was one example of many.

My point is only that there is a direct parallel between library UB and language UB.

The parallel for foo-core and foo would be the compiler and std.

The parallel between foo and foo-fork would be Rust and CrabLang.

The parallel between foo and "user who pins foo=1.2 and then relies on implementation details" would be Rust and "user who pins Rust to version X because they know they can get it to generate a very specific assembly sequence which is needed for their embedded use-case".

Rust does not support this last use-case. The author of foo also does not support this use-case.

There is no qualitative difference between these two types of UB, only that it is more common to rely on library implementation details than it is to rely on language implementation details.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 18, 2024

(Also, there's a fourth case - foo and foo-rewrite which copies the public api of foo but not the internals)

@arielb1
Copy link

arielb1 commented Dec 18, 2024

The parallel for foo-core and foo would be the compiler and std.

The difference here is that std is supposed to never cause undefined behavior in the nasal demons sense. std might depend on particular compiler implementation details, but the "nasal demons" rules apply to it just as they apply to everyone else.

I think one way of putting it is that undefined behavior is always considered to be a program error, and std is supposed to never cause unexpected program errors.

Someone other than std (e.g. random embedded user) can write code that exhibits undefined behavior but that a specific version of the compiler compiles into something that is useful to them. This is of course unsupported.

@ia0
Copy link

ia0 commented Dec 21, 2024

Actually, I found a world1 where library UB is the same as language UB. In this world, documentation is part of the program and participate to its semantics. Compilers are thus allowed to look at it and optimize based on their pre-condition that the program does not have library UB. The only difference between library UB and language UB at this point is whether the semantics was written by the language or the library, reflecting exactly the idea behind those names.

Footnotes

  1. I actually like this world, it's a good mental model. But I don't think it represents reality. AFAICT the operational semantics doesn't parse documentation to define the reduction.

@ia0
Copy link

ia0 commented Dec 22, 2024

And I found a point of view in the current world that would make (rigid) library UB the same as language UB, by moving the distinction between library UB and language UB to malleable library UB and rigid library UB:

  • Malleable library UB: Breaking the official1 library soundness specification but satisfying another soundness specification (e.g. auditing and pinning the code, having an out-of-band contract, etc). We could say in that case that there is no library UB.
  • Rigid library UB: Breaking all possible soundness specifications. This means that the program actually has undefined behavior, thus language UB.

By defining library UB as only rigid library UB (thus malleable library UB are not library UB), then we have equivalence between library UB and language UB.

Footnotes

  1. The one defined by the documentation of the library on crates.io. Note that this specification could be parametric over the user (e.g. distinguishing between users implementing a trait provided by the library from those using that trait, thus providing wiggle room for non-breaking changes for any of 2 specifications). We assume the one that applies to the user at hand.

@lilizoey
Copy link

I'm not sure if this has been addressed, but the term "language UB" seems to imply that UB caused by other languages would not be (rust) language UB. For instance if you use FFI to call into some c++ code, which then hits UB, is that language UB from the rust point of view? any library that exposes this FFI-based code would have to document the code such that it'd be library UB to hit that, so it would definitely be library UB at least.

@ia0
Copy link

ia0 commented Dec 22, 2024

the term "language UB" seems to imply that UB caused by other languages would not be (rust) language UB.

UB through FFI is UB (aka language UB): https://doc.rust-lang.org/reference/behavior-considered-undefined.html (search for the first occurrence of FFI). It's a good point that this could be confusing, but I'm not sure if it's confusing enough in practice to justify renaming language UB (which should really just be called UB).

@comex
Copy link

comex commented Dec 23, 2024

And I found a point of view in the current world that would make (rigid) library UB the same as language UB, by moving the distinction between library UB and language UB to malleable library UB and rigid library UB:

  • Malleable library UB: Breaking the official1 library soundness specification but satisfying another soundness specification (e.g. auditing and pinning the code, having an out-of-band contract, etc). We could say in that case that there is no library UB.
  • Rigid library UB: Breaking all possible soundness specifications. This means that the program actually has undefined behavior, thus language UB.

By defining library UB as only rigid library UB (thus malleable library UB are not library UB), then we have equivalence between library UB and language UB.

I still don't see how this distinguishes between language and library.

As people have said already, you can audit and pin the compiler, or have an out-of-band contract with the compiler. I'll add on that this is not theoretical, but a fairly common occurrence. See @RalfJung's list of crates deliberately using UB. Or this random example from the standard library. Or as a broader example, allocator implementations and how they interact with the aliasing model; every allocator is committing at least LLVM-level UB.

Well, in practice it's less "pin the compiler" and more "hope upgrades don't break anything". Less "out-of-band contract" and more "vibes on the unsafe-code-guidelines repo". But empirically, that's been good enough: many of these deliberate uses of UB have been around for a long time without being broken by compiler upgrades. In that sense, language UB too can be "malleable".

To be clear, having deliberate uses of language UB is an undesirable situation, and I greatly appreciate the work being done to find better alternatives. But you could say the same about deliberate uses of library UB.

@ia0
Copy link

ia0 commented Dec 23, 2024

As people have said already, you can audit and pin the compiler, or have an out-of-band contract with the compiler.

Yes, I'm among them: #539 (comment).

Or this random example from the standard library.

Thanks for the link, you're fixing #539 (comment) and confirming what I'm saying.

Well, in practice it's less "pin the compiler" and more "hope upgrades don't break anything". Less "out-of-band contract" and more "vibes on the unsafe-code-guidelines repo". But empirically, that's been good enough: many of these deliberate uses of UB have been around for a long time without being broken by compiler upgrades. In that sense, language UB too can be "malleable".

I was supposed to have addressed this in #539 (comment) with the dont-rely-on-me language which formalizes the notion of vibes with the non-determinism of a particular UB side-condition being checked in the semantics or not. But your parallel of language UB also being malleable helped me see in which ways people consider language UB and library UB to be the same.

My problem was that I was assuming "UB" to mean "language UB" in "library UB" (i.e. library language UB or language UB at library level), while it's actually a different concept. So let me try to add 2 more definitions to the definitions of safety invariant, validity invariant, safety requirement, and soundness invariant I gave in #539 (comment):

  • Language UB occurs when a program fails a side-condition in the operational semantics. It can be worked-around by knowing that the compiler actually implements an extension of the operational semantics where some side-conditions are not checked.
  • Library UB occurs when part of a program fails a safety requirement when calling another part of the program. It can be worked-around by knowing that the other part doesn't actually rely on that safety requirement to prevent language UB.

I believe those names are highly confusing and should change. They are more confusing than validity and safety invariant, which I don't find that problematic. My suggestions would be UB for language UB like every other language does, and safety violation for library UB (or something along those lines, no strong opinion except that it should not mention UB).

To be clear, having deliberate uses of language UB is an undesirable situation, and I greatly appreciate the work being done to find better alternatives. But you could say the same about deliberate uses of library UB.

This actually hints at another distinction between UB and safety violations: formalism.

We have 2 dimensions going from completely informal (0) to fully formal (1):

  • X-axis: How formal is the language specification.
  • Y-axis: How formal are library specifications (let's just focus on soundness).

Although it doesn't matter for what follows, I'm going to assume that Y ≤ X and thus focus on the bottom right triangle instead of the whole square.

My impression is that today we have X=0.5 (there is a concrete ambition to reach a formal specification of the language, and there are already progress artifacts like MiniRust, Miri, and subset languages in research) and Y=0.1 (English prose only but trying to use a common vocabulary). In particular we have Y << X.

The utopia world I described in #539 (comment) is at coordinates X=1 and Y=1. In this world, https://rust-lang.github.io/rust-project-goals/2024h2/Contracts-and-invariants.html is a thing and library authors can write attributes1 to specify their library. We have the same level of formalism for both UBs and safety violations.

I believe that saying that UB and safety violations are the same may give the impression that Y = X, i.e. they have the same level of formalism, which is currently not true.

Footnotes

  1. Those attributes are part of the program and thus could participate to its semantics. If they do, the compiler is allowed to optimize on them and Miri is required to check them. But this is not necessary since safety violations don't need to be UBs.

@ia0
Copy link

ia0 commented Dec 23, 2024

To complete the naming suggestions (using the definitions I gave for the current names, using [] for optional terms, and using {,} for alternative terms):

context current name suggested names
language operation language UB UB (undefined behavior)
library operation library UB { safety [requirement], soundness [proof] } violation
library specification safety requirement { safety, soundness } requirement
type interpretation validity invariant { language, validity } invariant
type interpretation safety invariant { default [soundness], safe-code, safety } invariant
type occurrence interpretation soundness invariant { [{ actual, effective }] soundness, unsafe-code } invariant

@arielb1
Copy link

arielb1 commented Dec 23, 2024

library UB

I think that "library UB" is an OK name for what happens when calling a library function in a way that there are no requirements placed on the library's behavior [which means the library is allowed to invoke language UB] (for safe functions, this can only happen if you pass a parameter for which the type safety invariant does not hold - tho the converse does not hold - it is possible to document a function so that it has defined behavior even if there are some parameters for which the type safety invariant does not hold).

[As opposed to the state where you merely have a value that violates some library's safety invariants, but are not passing that value to that library].

The difference between library invariant and language invariant is where they have to hold. The language invariant always has to hold, the library invariant has to hold when calling particular functions. By default, all functions require the library invariant defined by their argument types, but functions can opt-out from that via documentation. This model both (a) preserves the parallel between language UB and library UB and (b) perfectly describes the Pin situation.

The relationship is that the safety invariant of a type is the default library invariant of functions taking that type as a parameter. If you have a type like &mut &mut bool that does not come from a library, it's safety invariant is a library invariant of the functions taking it as a parameter, not a library invariant of the library defining that type.

@ia0

This comment has been minimized.

@ia0
Copy link

ia0 commented Dec 23, 2024

Actually found a way to make language UB and library UB fully parallel as advertised by #539 (comment). One needs the following definitions:

  • Unexpected behavior means that a source program was compiled to a target program, but the target program does not satisfy the (correctness) specification of the source program. This can only happen if the compiler is incorrect, the source program has undefined behavior, or the source program does not satisfy its specification. We'll only care about the second cause by assuming the compiler correct (the third cause is void because a program needs a meaning to satisfy any property including its specification).
  • X UB means possible unexpected behavior due to the violation of a safety condition of X. Note how "UB" in "X UB" does not mean undefined behavior but (possible) unexpected behavior.

By this definition, language UB is equivalent to UB (undefined behavior) as before. But we can now give a proper meaning to "may" in "language UB may cause an AM fault" from #539 (comment) which is "language UB may cause unexpected behavior" (or equivalently "undefined behavior may cause unexpected behavior").

Similarly, library UB may cause unexpected behavior too. But it may also cause undefined behavior. While language UB will always cause undefined behavior because that's its definition.

To sum up:

causes undefined behavior unexpected behavior
language UB always (equivalent by definition) sometimes (depends on the compiler)
library UB sometimes (depends on the library implementation) sometimes (depends on the library implementation and the compiler)

So language UB and library UB are fully parallel only if we interpret UB as (possible) unexpected behavior instead of undefined behavior in their names. If we interpret it as undefined behavior (as I naively did before), then those are different concepts.

@ia0
Copy link

ia0 commented Dec 23, 2024

Now that things are clearer to me, and given how I polluted the thread, I feel responsible for summarizing the current status of definitions and names.

Language UB and library UB

The definitions are given by @CAD97 in #539 (comment). They are "violating a safety condition documented by the language (resp. library)". I believe there is agreement on those definitions.

The names are still in discussion:

  • @digama0 suggested "language UB" or simply "UB", and "safety violation".
  • I also suggested "UB" and "safety violation". I believe UB should not deviate from its usage in C due to its long history. It should just mean undefined behavior as defined by the language semantics.
  • @RalfJung argumented that UB should mean "violating a safety condition" which is coherent with the current names.
  • @Lokathor argumented that using UB for violations of both language and library safety conditions may confuse users where they believe library UB is not as bad as language UB because it may not lead to language UB (Miri will not always report a language UB on a library UB). And some may even confuse both and say that language UB is not bad.
  • I argumented that the difference in the level of formalism today between language UB and library UB contributes negatively to the parallel between those concepts.

Validity invariant, safety invariant, and soundness invariant

The definitions of validity and safety invariant are given by @RalfJung in https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html:

  • Validity invariant: "We want the compiler to make some type-based assumptions about data. [...] So, we need an invariant that the compiler may assume always holds, justifying the optimizations we are already doing (and more we want to do in the future). [...] we can ask if some data is valid at type T."
  • Safety invariant: "This is the invariant that every safe function can assume holds. This invariant depends on the type, so I will also speak of data being safe at type T, meaning the data satisfies the safety invariant."

Those definitions are only "at type T" regardless of where the type occurs in the program. I think a possibly missing definition is what I call anonymous types in the unsafe mental model and what I called soundness invariant in this thread. That's needed when proving soundness with a type system rather than a verification tool, although that's mostly a style concern. I give a definition in #539 (comment) along with the 2 other invariants:

  • Soundness invariant: This is the invariant used in the proof of soundness (absence of UB) of the program. It is a refinement type of the validity invariant of a type T at a given occurrence of T in the program. Different occurrences may have different soundness invariants. If the same data has different soundness invariants at different occurrences, appropriate subtyping constraints are generated.

The names are still in discussion:

  • @RalfJung suggested to use "language invariant" and "library invariant". The rationale being that they go along with "language UB" and "library UB" respectively.
  • @CAD97 proposed the same names.
  • I mentioned that this parallel between "validity/safety invariant" and "language/library UB" does not hold because it's the soundness invariant that's relevant for library UB, not the safety invariant. There is a parallel between "language/library UB" and "language/library safety condition" so the problem is really between "library safety condition" and "safety invariant" because the safety invariant just looks at the type and misses the safety documentation, while the soundness invariant embeds the safety documentation.
  • @Lokathor argumented that validity and safety invariants should not share the same first letter.
  • @arielb1 suggested to only rename "validity invariant" to "language invariant" but keep "safety invariant" as is.

I hope I didn't miss some of the naming discussions.

@comex
Copy link

comex commented Dec 23, 2024

Yes, I'm among them

Oops, I should have noticed that.

In particular we have Y << X.

True!

So language UB and library UB are fully parallel only if we interpret UB as (possible) unexpected behavior instead of undefined behavior in their names.

If you assume a specific implementation, then there are more layers below that. As you say, library UB sits on top of language UB, but below that there is MIR UB, then LLVM IR UB, then operating system UB and/or CPU architecture UB. At each layer, UB can result in either UB in the next layer, or defined-but-undesired behavior at that layer.

On some older CPUs it might go all the way down to transistor-level UB.

@chorman0773
Copy link
Contributor

chorman0773 commented Dec 23, 2024

Yeah, I don't think it's useful to distinguish between library UB and language UB.

You can pin libraries that aren't the standard library1 to do on-paper UB things that are fine in practice, but you can also pin your compiler to the same objective. The library doesn't even need to upgrade the library UB to language UB to get UB-esque outcomes either - it can do its own arcane invocations to summon nasal demons.

The reason it can be useful to distinguish between safety invariant and validity invariant, is a safety invariant can be reliably violated without consequence provided you don't use the value with a broken invariant as that type (including dropping it). You can't violate a validity invariant, even transiently, without causing UB.

Footnotes

  1. Also, while we talk about library UB as including the standard library, it is equally hard to pin the standard library as it is to pin the compiler (since the standard library comes with the compiler implementation and may vary between versions or vendors) - which is harder than pinning a crates.io library. Therefore, it's definitely in the same league as language UB from that respect.

@CAD97
Copy link

CAD97 commented Dec 24, 2024

One of the benefits of the terms "library UB" / "language UB" is that lib UB is not lang UB. Lib UB may lead to lang UB, but they are disjoint concepts.

I'm not particularly fond of "safety violation" as the term for communicating lib UB, as causing lang UB is also a safety violation.

The split between "safety" and "soundness" is also relevant. The safety condition (whether invariant, precondition, or otherwise) is the documented condition of a type or operation that must be upheld by unsafe code (and cannot be violated by safe code outside the safety barrier) for its usage to be sound. An execution is sound if it is free from language UB. An API is sound if it cannot be used to execute language UB without violating a safety condition. An implementation is sound if it cannot violate a safety condition without having another safety condition being violated first.

So I think we can/should use all four terms, actually, at least when including your mental model documentation. The definitions I'd use:

Language Requirements
The conditions documented by the language that `unsafe` code must uphold. A violation is Language UB.
Library Requirements
The conditions documented by the library that `unsafe` code (and all code inside the safety boundary) must uphold. A violation is Library UB.
Safety Requirements
The union of all conditions required to uphold safety (soundness parametrized over all correct implementations of the used functionality). A violation is a Safety Violation.
Soundness Requirements
The subset of the safety requirements which are necessary to uphold soundness (absence of Language UB). A violation is a Soundness Violation.

(Although note that these definitions rely on the technically informal assumption that despite UB time traveling, it cannot time travel over an observable sequence point (side effect / IO) of the program to be fully formally accurate. (A compiler would likely need to violate actual causality1 to violate this property.) This still preserves all other "time travel" effects of UB.)

"Validity" isn't an ideal term because of how overloaded "validity" already is, e.g. the current stable std documentation talks about pointer validity2. It's also important to note that "soundness" isn't a useful term without being clear about the scope; is it parametrized by a single execution or all possible executions; with a specific version of implementation details or all valid choices of implementation details?

Does this phrasing match your expectations/needs, @ia0? The one thing I think you brought up that this framing doesn't acknowledge is the concept of "safety conditions" which are expected to be opened temporarily. But I think that this distinction doesn't come from the term used to classify the requirement, but whether it is the requirement of a type or function3; the type's invariants are not enforced (cannot be violated) while the type is at rest (untyped memory), but only upon execution of a function (or for language invariants, a typed copy). That leaves space for e.g. robust functions which syntactically take &mut [u8] but which only perform writes to accept reference to uninit (assuming trivial drops don't assert initialized).


The one wrinkle is that the language potentially defines a "library invariant" in that the vtable part of a raw dyn pointer needs to match the expected pointee type for pointer upcasting, which is a safe operation. (Current leaning is, IIRC, that the vtable part not be required to be valid for access until a dynamic call is made.) More precisely, it would be a language defined precondition on the operation, rather than an invariant attached to the type (thus all typed copies), but because the operation is safe, it's a safety condition relevant to safe code as well as unsafe code. It makes the language/library split a bit fuzzier than it might otherwise be. All other safe primitive operations are guaranteed sound by a typed copy (plus retag if reference), IIRC.

Footnotes

  1. I've yet to see/make a proper argument, but I believe this property is necessary for the program stdin().read_line()?; unreachable_unchecked() to be free from UB if no newline is ever sent to stdin; if the UB could "time travel" before the IO, it could insert a write to its own stdin before the read.

  2. Technically this is constrained to always be validity of a read/write operation, so is directly related to type validity (of a typed copy operation) in that it's about validity of language defined operations, but the type/byte validity of a pointer being distinct from "pointer validity" is certainly not great.

  3. And it's here that expect my classification splits the most from yours, as AIUI your unsafe mental model doc reifies the safety conditions of functions into flow-refinement anonymous subtypes. (FWIW, I think that framing the unsafe reasoning of "type plus conditions that hold" as refinement typing is more convoluted than is useful as a teaching tool, unfortunately. But perhaps I'm wrong, and it's certainly useful as a step from informal to more formal reasoning.)

@chorman0773
Copy link
Contributor

I've yet to see/make a proper argument, but I believe this property is necessary for the program stdin().read_line()?; unreachable_unchecked() to be free from UB if no newline is ever sent to stdin; if the UB could "time travel" before the IO, it could insert a write to its own stdin before the read

It actually does not for the simple reason that if such is never sent, nothing with undefined behaviour ever occurs.

However, if you did stdin().read(&mut [0u8; 32]); unreachable_unchecked() it's entirely valid under current rules to eliminate the read (or same if it was replaced with a write).

@ia0
Copy link

ia0 commented Dec 24, 2024

I very often say "specification" to mean "soundness-related specification". I'm clarifying that upfront because it may be confusing if one understands "specification" to mean "correctness specification".

I'm not particularly fond of "safety violation" as the term for communicating lib UB, as causing lang UB is also a safety violation.

In that case, what about:

  • "language UB" renamed to "language safety violation", "language SV", or "language violation" (the term "undefined behavior" or "UB" can still be used)
  • "library UB" renamed to "library safety violation", "library SV", or "library violation"

The problem I have (and I believe @digama0 too) is that UB is a language specification concept, not a program specification concept. Talking about "library UB" is mixing program specification and language specification. So we have to choose between:

  • Preserving the usual meaning of UB about undefined operations in the language semantics.
  • Extending the usual meaning of UB to cover all types of contract violations (language and library).

An API is sound if it cannot be used to execute language UB without violating a safety condition.

How can an API be executed? I don't think extending the notion of soundness to APIs or specifications is useful. You mention this later that soundness is abused, I agree. You can see my reply there.

An implementation is sound if it cannot violate a safety condition without having another safety condition being violated first.

I'm assuming you meant "without having one of its safety condition violated first".

And I would tend to push back against this definition, because soundness should only be about UB, and here you extend the notion of soundness to library safety conditions, not just language safety conditions. That's the same issue as I presented above. I think this is the crux of the problem. Do we want to extend UB and soundness to something else than just the language semantics? I'm quite strongly opposed as this would deviate from their usual meaning.

The way I would formulate your sentence would be: An implementation is sound if it doesn't do any of the following without having its safety condition (as defined by its specification aka API) violated:

  • Execute language UB.
  • Violate the safety condition of another library.
  • Violate its own safety condition as seen by its user (i.e. the implementation is correct with respect to its safety specification).

Soundness Requirements
The subset of the safety requirements which are necessary to uphold soundness (absence of Language UB). A violation is a Soundness Violation.

I don't think this is a useful concept because it suggests that it is fine to violate library safety conditions. Also note that this is a different concept than what I mean by soundness invariant (more precisely soundness type occurrence invariant). Soundness invariant is simply the safety condition of a library encoded into the types occurring in the library API. This is just a matter of style. What matters is to understand that safety condition is a different concept than the safety invariants of the types occurring in the library API.

(Although note that these definitions rely on the technically informal assumption that despite UB time traveling, it cannot time travel over an observable sequence point (side effect / IO) of the program to be fully formally accurate. (A compiler would likely need to violate actual causality1 to violate this property.) This still preserves all other "time travel" effects of UB.)

This assumption can be made formal. It simply follows from the fact that the program specification is not part of the program, in particular the compiler doesn't know it, so it must assume that it can be arbitrarily specific. To take your example, if the program specification is that standard input should never contain a newline character (or whatever read_line considers the end of a line) then the program is defined to simply consume its input and never return (in practice it will eventually fail an allocation and panic).

"Validity" isn't an ideal term because of how overloaded "validity" already is

I agree with this and it's indeed a good argument to try to change that name. The thing is that it's very convenient to talk about valid and invalid values, which I do in the unsafe mental model. But as you pointed out, in practice we usually only talk about pointer reads or writes being valid, not really about a value being valid. I checked the documentation of NonZero and indeed there's no mention of zero being an invalid value (same for NonNull and bool). It would be nice if the replacement name allows for an adjective to values, but doesn't need to be a strong requirement.

It's also important to note that "soundness" isn't a useful term without being clear about the scope; is it parametrized by a single execution or all possible executions; with a specific version of implementation details or all valid choices of implementation details?

I agree that soundness is a bit abused. Technically it is a property of a predicate over specified programs such that programs satisfying this predicate don't have undefined behavior for all their executions satisfying their specification. One way to define such predicate is with a type system. But you can also use static analyzers. Such static tools are sound if the predicate they define has the soundness property. The completeness property is the opposite direction. If a program doesn't have undefined behavior for all its executions satisfying its specification, then the predicate should hold on this specified program.

One way the word is extended is to talk about sound programs to mean that they don't have undefined behavior for all their executions satisfying their specification. Another is for sound libraries to mean that they implement their specification, don't have undefined behavior, and don't violate the specification of other libraries as long as their own specification is not violated.

So to answer your 2 questions:

  • Soundness is for all executions admissible by the specification.
  • Soundness is for a specific implementation because you need to be able to execute it.

The one thing I think you brought up that this framing doesn't acknowledge is the concept of "safety conditions" which are expected to be opened temporarily.

I'm not sure what you mean by that. If it's about the example you gave at the end of the paragraph:

That leaves space for e.g. robust functions which syntactically take &mut [u8] but which only perform writes to accept reference to uninit (assuming trivial drops don't assert initialized).

Then this is not about "safety condition" being opened/broken but "safety invariant" being opened/broken. A robust function taking &mut [u8] specifying that it accepts uninitialized input must be understood as:

  • The safety invariant of its argument is mutable references to initialized memory, as defined by the type. Callers can violate this because it's not part of the safety condition (library soundness contract).
  • The safety condition (here a guarantee because it's robust and not a requirement as is the case for unsafe functions) of its argument is mutable references that don't need to be initialized, as defined by the type and documentation. This is what I call soundness invariant. Callers cannot violate this (unless they look at the implementation and come up with a different safety condition).

3. I think that framing the unsafe reasoning of "type plus conditions that hold" as refinement typing is more convoluted than is useful as a teaching tool, unfortunately. But perhaps I'm wrong, and it's certainly useful as a step from informal to more formal reasoning.

It's a matter of style, so some may find it more convoluted and others may find it more intuitive. That's fine. I also don't think it's a necessary step for more formal reasoning. This is just a matter of style whether one prefers to prove a program sound through a sound type system (where properties are in the types), a sound verification tool (where properties are on the side), or a combination of both.

@CAD97
Copy link

CAD97 commented Dec 24, 2024

I think we basically agree at this point except for, IIUC, this big mismatch in application of terms:

  • I'm saying «safety invariant» ⊆ «safety condition», but
  • You're saying «safety condition» ⊆ «safety invariant»; and
  • I'm claiming «soundness requirement» ⊆ «safety requirement», but
  • You understand «soundness requirement» ≡ «safety condition».

AIUI:

You Me
safety condition soundness condition
safety invariant safety condition

As a particular example: that the str bytes are UTF-8 is a safety condition of str::len (implied by it being a safety invariant of str). However, it is not a soundness condition, as the function does not even read the value of any bytes, thus invalid UTF-8 does not result in potential for UB.

I hold that my use of the terms that an invariant is a condition is the common understanding, and you're the odd one out to define «safety condition» as a looser constraint than «safety invariant» is1.

Everything else at this point is we're talking past each other due to that terminology mismatch. I think I need to let @RalfJung or someone else say their part now. But some specific points:

Talking about "library UB" is mixing program specification and language specification.

This is very much an uphill battle, given how std docs claim any violation of its documented safety requirements to be UB, despite many only being library UB.

I don't think extending the notion of soundness to APIs or specifications is useful.

Then what does it mean that “std::env::set_var is unsound” if an API cannot be sound? It can be used safely (in a single threaded context), but it can also be used in combination with other safe (libc) API to cause language UB.

I'm assuming you meant "without having one of its safety condition violated first".

If you include the implicit safety condition of "all of the safety invariants of the types I take hold" as one of its safety conditions, and not a safety condition of whatever library defined that type, then yes.

soundness should only be about UB, and here you extend the notion of soundness to library safety conditions, not just language safety conditions.

Because I think being sound in the face of rustup update; cargo update is relevant. Especially since cargo library implementations don't get to choose their dependencies' versions.

But in retrospect I should've also added the implicit second clause of “and it cannot violate a soundness condition without having another soundness condition being violated first,” which may have made the difference between the two clauses more apparent.

I don't think [soundness requirements are] a useful concept because it suggests that it is fine to violate library safety conditions.

It is acceptable for a library to open/violate its own safety invariants, precisely because being within the scope of the same soundness proof enables it to separate safety requirements from soundness requirements. This merely puts the program state into an unsafe state that must be handled carefully, and not exposed to code which expects the safety invariants to all hold.

It is not acceptable for code to violate its upstreams' safety conditions, because the library guarantee only extends to the safety conditions being sufficient for soundness. That soundness conditions may be weaker than the documented safety conditions is a private implementation detail that may change in an update.

But distinguishing between library safety violations and language UB is still useful, because automated tooling can only ever diagnose the latter, unless assertions are manually added.

The thing is that it's very convenient to talk about valid and invalid values,

And I expect we still will. It will just have a contextual meaning of satisfying the relevant requirements rather than specifically whether bytes are a valid instance of the syntactical language type shallowly (validity invariant).

I.e. “accepts any valid T” is a meaningless statement without further context (i.e. valid under what assumed potential operations).

Soundness is for a specific implementation because you need to be able to execute it.

I was talking about implementation details that are encapsulated away from the analyzed code, thus “don't violate the specification of other libraries” means soundness is parametric over all encapsulated implementation details (or more precisely, it is based on its upstreams' soundness specification only, which must not be loosened to be sound in fewer cases by an update).

Footnotes

  1. I understand how this occurred given you were extending the concept of "safety invariant" to allow opening the invariant, and need a term for the weakened condition. But I hold that the better axis is to weaken safety to soundness.

@ia0
Copy link

ia0 commented Dec 24, 2024

Everything else at this point is we're talking past each other due to that terminology mismatch.

Exactly, so I'm putting the 2 main contention points below. I'm replying to your comment at the end of this comment.

Contention points that need agreement

Agree on what UB should mean

There are 3 competing definitions:

  • Preserving the usual meaning of UB about undefined operations in the language semantics:
    • @digama0: I would call them "language UB" (or just UB as I think there is only one thing that should get that name). source
    • @Lokathor: I don't think we should call any other similar situations "library UB" either. source
    • @ia0: UB is a language specification concept, not a program specification concept. source
  • Extending the usual meaning of UB to cover all types of safety contract violations (language and library).
    • @RalfJung: Violating such library preconditions is often called UB, and I think it makes sense. source
  • Extending the usual meaning of UB to cover potential UB (that's slightly different as the one above as it doesn't care about the contract and instead cares about the implementation).
    • @Diggsey: I might want to just document that it's always UB to call foo(0). source
    • @CAD97: All instances of “latent UB” have a risk factor of how likely it is that the UB will be exploited. source

How can we come to an agreement here?

EDIT: Adding name to those options so we can refer to them:

  • No language UB: There is only UB (no qualifier). It reads undefined behavior and means undefined language operation.
  • Language UB < UB: UB reads undefined behavior and means safety contract violation. It can be refined with qualifiers.
  • UB = language UB: UB reads undefined behavior and means undefined language operation when used without qualifier. It reads unexpected behavior and safety contract violation when qualified (details) with a focus on the possible consequence of unexpected behavior.

Agree that library UB is not about safety invariant

Violation name Contract violated Type invariant violated
Language UB language safety condition validity invariant
Library UB library safety condition soundness invariant

The safety condition of a library is the safety invariant of the types of its API modified by its safety1 documentation. The soundness invariant is the name of the modified invariants. Each type occurrence may be modified differently, so the soundness invariant does not apply to a type but to a specific occurrence of a type (in the library API).

Note that modifying the safety invariant is not necessarily a refinement (strengthening), it can also weaken the invariant up to the validity invariant. This is why in the unsafe mental model I start from the validity invariant, such that only refinements are used.

Replies to @CAD97

  • I'm saying «safety invariant» ⊆ «safety condition», but
  • You're saying «safety condition» ⊆ «safety invariant»; and

I'm actually saying something else. The soundness invariant (which I'm fine to call safety condition, although I would prefer safety and robustness condition) is not comparable with the safety invariant. This is because it refines the validity invariant, not the safety invariant. There are 4 options:

soundness < safety safety < soundness Description
Holds Holds The type occurrence is neither robust nor unsafe (the invariants are equal).
There is no soundness documentation on this type occurrence.
The function is made neither unsafe nor robust because of that type occurrence.
Holds Does not hold The type occurrence is robust.
If used as a function parameter, it makes the function unsafe (contra-variance).
If used as a function result, it makes the function robust (co-variance).
Does not hold Holds The type occurrence is unsafe.
If used as a function parameter, it makes the function robust (contra-variance).
If used as a function result, it makes the function unsafe (co-variance).
Does not hold Does not hold The type occurrence is both robust and unsafe.
If used as a function parameter or result, it makes the function both robust and unsafe.
  • I'm claiming «soundness requirement» ⊆ «safety requirement», but
  • You understand «soundness requirement» ≡ «safety condition».

Yes, because I don't have the same definition as you for soundness requirement (I would prefer the term "soundness-related specification" or "specification for the purpose of soundness"). With your definition (looking at the implementation to come up with a specification instead of looking at the specification) I agree on the sub-typing relation (assuming we only care about the requirements part and not the guarantees part). So I think we agree here. And I'm saying that we shouldn't talk about your notion of soundness requirement, because it's legitimizing violating library UB, by saying who cares about the safety condition the library author wrote, just read the implementation.

AIUI:

You Me
safety condition soundness condition
safety invariant safety condition

No, it's more:

Me You
I don't name this concept soundness condition
safety condition safety condition

As a particular example: that the str bytes are UTF-8 is a safety condition of str::len (implied by it being a safety invariant of str). However, it is not a soundness condition, as the function does not even read the value of any bytes, thus invalid UTF-8 does not result in potential for UB.

I agree with this statement, but I don't think it's a useful observation. This is the notion of malleability which I don't think is good to mention too much and in particular we shouldn't name it publicly. People should simply not violate contracts. If they do, they should come up with a story themselves. We shouldn't provide them the tools and make their life easier. We'll end up in a very bad ecosystem. Such behavior should be exceptional.

I hold that my use of the terms that an invariant is a condition is the common understanding, and you're the odd one out to define «safety condition» as a looser constraint than «safety invariant» is.

That's correct. I'm distinguishing safety condition from safety invariant (there's no real subtyping relation between them, it can go both ways when we consider robustness, which is necessary for composable proofs). I'm adding this to the list of contention points that need agreement below.

Then what does it mean that “std::env::set_var is unsound” if an API cannot be sound? It can be used safely (in a single threaded context), but it can also be used in combination with other safe (libc) API to cause language UB.

It's because this API also has a correctness specification. It's not enough to implement the type and safety documentation. You also care about the non-safety documentation. Otherwise the implementation that does nothing is trivially sound.

I guess we could say an API cannot be sound if there are no sound implementation satisfying both the correctness and the soundness specification of the API.

If you include the implicit safety condition of "all of the safety invariants of the types I take hold" as one of its safety conditions, and not a safety condition of whatever library defined that type, then yes.

Yes, for me the safety condition covers both the type invariants and the safety documentation. In particular I distinguish the safety documentation from the safety condition. That's probably why I prefer to call the safety condition the soundness-related specification or the soundness invariant.

Because I think being sound in the face of rustup update; cargo update is relevant. Especially since cargo library implementations don't get to choose their dependencies' versions.

I guess that's where there is a misunderstanding. For me, violating the safety condition of a library (called library UB) is forbidden, the same way language UB is forbidden. And it's not forbidden because library UB may result in language UB which is forbidden, it's because violating a safety contract is forbidden full stop. It doesn't matter if it's immediate language UB or not. That's exactly why I don't like your notion of soundness requirement which opens the door to such behavior. If you want a healthy ecosystem, people should not violate contracts (in particular safety contracts). Library UB is how you prove soundness in a compositional way. It's important to understand that soundness is not completeness. It's perfectly fine to be over-restrictive and reject programs without undefined behavior because some contract is too strict.

It is acceptable for a library to open/violate its own safety invariants, precisely because being within the scope of the same soundness proof enables it to separate safety requirements from soundness requirements.

I agree with this statement and the other ones in the same block. But I would formulate it differently. You're not really doing library UB in that case, because you have access to a different safety contract, an internal one. I wouldn't call this library UB, otherwise it might suggest other people that it's fine to have library UB.

Footnotes

  1. Depending whether we want to distinguish library safety condition from library specification for the purpose of soundness, we should also include the robustness documentation (which is part of the library specification for the purpose of soundness).

@ia0
Copy link

ia0 commented Dec 26, 2024

The UB confusion can actually be summed up in a picture.

Picture in the syntactic world

UB = Language UB <------------------\
     |                (refines to)   --- UB
     |  Library UB <----------------/
     |  |
     |  \---------------\
     |     (may cause)   ---> UB
     \------------------/

Picture in the semantic world

UB = Language SV <------------------\
     |                (refines to)   --- SV
     |  Library SV <----------------/
     |  |
     |  \---------------\
     |     (may cause)   ---> IB
     \------------------/
  • UB (undefined behavior): operation that the operational semantics does not define
  • SV (safety violation): usage that does not conform to a safety specification1
  • IB (incorrect behavior2): actual execution3 that does not conform to the program correctness specification4

Parallel lines

We have 2 pairs of parallel lines: those refining SV and those possibly causing IB. That's why we can say that library SV and language SV are fully parallel.

However, UB distinguishes between them. In particular:

  • Language SV proves that the program is unsound because it has UB.
  • Library SV breaks the proof that the program is sound. The program may be proven sound with another proof, or it may be proven unsound if it has UB.

Confusion

The confusion is the result of the interpretation function from syntax to semantics being not well-defined. The best I can come up is:

  • In the context of a safety specification, UB means SV.
  • In the context of an actual execution, UB means IB.
  • In the context of the operational semantics, UB means UB.
  • Otherwise, all bets are off.

Footnotes

  1. More precisely, a specification for the purpose of proving the absence of UB.

  2. Formerly unexpected behavior, but without the first-letter confusion.

  3. Execution for a given implementation stack (compiler implementing language, OS implementing environment, CPU implementing ISA, etc. down to universe implementing current laws of physics), compared to reduction which is only at the operational semantics level.

  4. This does not necessarily imply UB down the implementation stack. UB can be lowered to a defined program that does not behave like the initial source program correctness specification.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 7, 2025

Preserving the usual meaning of UB about undefined operations in the language semantics:

There is ample precedent for calling things "UB" that are library UB; the entire C++ standard library is specified that way, and as mentioned above the Rust standard library docs also use "UB" to refer to more than just language UB. So I don't agree with your framing here implying that I would not be preserving the usual meaning of UB. I think I am in fact preserving its existing meaning. Or rather, I am disentangling what is usually just called "UB" into two subclasses: "language UB" and "library UB".

Replying to one point by @CAD97:

The one wrinkle is that the language potentially defines a "library invariant" in that the vtable part of a raw dyn pointer needs to match the expected pointee type for pointer upcasting, which is a safe operation. (Current leaning is, IIRC, that the vtable part not be required to be valid for access until a dynamic call is made.) More precisely, it would be a language defined precondition on the operation, rather than an invariant attached to the type (thus all typed copies), but because the operation is safe, it's a safety condition relevant to safe code as well as unsafe code. It makes the language/library split a bit fuzzier than it might otherwise be. All other safe primitive operations are guaranteed sound by a typed copy (plus retag if reference), IIRC.

I don't think this is a special case in any way. Yes, the language invariant is not sufficient to justify soundness of a primitive operation here. But that's completely fine, and happens elsewhere as well:

  • The language invariant I am proposing for &bool does not require the data in memory to be a valid bool, and yet this is required to ensure soundness of *
  • The language invariant for fn() says nothing about the code that will be executed when the function is being invoked, and yet we need this to be a safe function to ensure soundness of function calls.

The language invariant is not the invariant relevant for a type safety proof. That would be the safety invariant / library invariant. (There's a reason I used "safety" here in my original blog post. ;) So I don't think this observation is surprising or concerning.

@ia0
Copy link

ia0 commented Jan 7, 2025

Thanks! This answers the first point and gives a meaning to the notion of "program UB". For example:

  • Executing a Rust program implementing a PDF reader assuming the input file is well-formed (for performance reasons and because it is expected to run in controlled environments only), but providing a malicious PDF.
  • Executing any safe Rust program on a system with malicious processes. By default, safe Rust programs assume nobody messes with their runtime state (and the kernel, CPU, etc are not buggy).

This leaves us with the second point and the topic of this issue. There is no parallel between language/library UB and validity/safety invariant. More precisely, this parallel only holds outside the scope of unsafe. So if we use "language invariant", we shouldn't use "library invariant" to avoid confusion.

Assuming everybody prefers "language invariant" for validity invariant, I can see at least 3 options for safety invariant:

  • "safety invariant" (status quo): this was suggested here and I also believe that's a reasonable option.
  • "library invariant": this was suggested in the issue description and I'm worried this will lead to confusion due to the parallel being only partial.
  • "safe library invariant" (resp. "library definition invariant"): I'm suggesting this as a middle ground. We could always say that the "safe" (resp. "definition") word may be omitted for brevity. With this terminology, it is clear that the parallel only holds for type occurrences in safe1 public APIs (resp. for the type as it is defined, as opposed to where it's used).
  • "definition invariant": Same as above but without the "library" word. That's the invariant of the type as defined by its library. Other libraries may use that type with safety documentation to locally modify this default invariant for given type occurrences.

Footnotes

  1. not unsafe and not robust

@ia0
Copy link

ia0 commented Jan 7, 2025

the entire C++ standard library is specified that way

By the way, isn't the C++ standard library part of the C++ language? Aren't implementations allowed to look at symbol names and if they recognize something from the standard library optimize based on the specification of this symbol by the language? If yes, I'm not convinced by this part of your argument. However, I'm convinced by the other part, which was also mentioned earlier: "This ship has unfortunately sailed long ago" (source).

@RalfJung
Copy link
Member Author

RalfJung commented Jan 8, 2025

There is no parallel between language/library UB and validity/safety invariant.

I don't agree; I still think there is a completely fine parallel here: generally, violating the library invariant without violating the language invariant risks library UB but does not immediately risk language UB.
I don't think the extra fine-grained distinctions you are trying to introduce here are useful.

By the way, isn't the C++ standard library part of the C++ language? Aren't implementations allowed to look at symbol names and if they recognize something from the standard library optimize based on the specification of this symbol by the language?

I guess since they don't make their AM explicit one can't really say... but I don't think anyone knows how to make all those constraints operational so I don't think there is a reasonable way to make all of this AM-level UB.

@ia0
Copy link

ia0 commented Jan 8, 2025

I don't think the extra fine-grained distinctions you are trying to introduce here are useful.

The reason I believe the distinction between safety invariant and "soundness invariant"1 matters, is that type invariants only matter when writing unsafe code (otherwise the type system is here), and unsafe (or robust) public APIs are where those invariants may differ, so it's important to distinguish them. But I agree that:

  • This is somehow a matter of style, so not that important. I'm fine with "library invariant", I understand it as "default library invariant" or "library definition invariant". The word library is good, just not enough.
  • There might actually be no confusion if people read the definition of "library invariant" instead of (or before) looking at the parallel. In particular, as long as they understand it's a type invariant and not a type occurrence invariant, which is enough to disambiguate any possible issue.

Footnotes

  1. Invariant for a given type occurrence in a public API. This is the combination of the safety invariant of the type and the safety documentation of the type occurrence. Violating the soundness invariant is library UB, the same way violating the validity invariant is language UB.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 9, 2025

Invariant for a given type occurrence in a public API.

That's not an invariant, that's just the precondition of a particular function. I don't think we need to come up with a new term for this very well-established concept.

@steveklabnik
Copy link
Member

steveklabnik commented Jan 9, 2025

By the way, isn't the C++ standard library part of the C++ language?

They're both part of the standard, but it really means what you mean by "part". For C++23, the standard library is defined in section 17.

It's roughly analogous to "is libcore part of the rust language?". Clause 17 standardizes the "language support library," which does the rough equivalent of defining what we refer to as "lang items" here in Rust.

Aren't implementations allowed to look at symbol names and if they recognize something from the standard library optimize based on the specification of this symbol by the language?

Yes. A famous example here is how memory allocation isn't considered observable behavior, so this code:

#include <new>  

int foo(){
    int *five = new int;
    return 42;
}

will compile to

foo():
  mov eax, 42
  ret

(This is also the case if you use malloc instead of new)

I don't have any strong opinion about if this has any implications on how Rust defines UB, just figured I'd add some context here.

@ia0
Copy link

ia0 commented Jan 10, 2025

That's not an invariant, that's just the precondition of a particular function.

Those can be seen as the same thing, it's a matter of style. One is "contracts as types" and the other is "contracts as pre-posts". For example, RefinedRust has annotations in the "contracts as pre-posts" style but is actually implemented with a "contracts as types" style. A concrete example would be:

/// Randomizes the content of a slice.
///
/// # Safety
///
/// `ptr` must be valid to write for `len` bytes
pub unsafe fn randomize(ptr: *mut u8, len: usize);

In this example, I'm considering ptr constrained by the safety documentation and len unconstrained (which is the default without safety documentation).

In the "contracts as pre-posts" style, we have a lemma to prove by the implementer and to use by the callers:

randomize: ∀ (p ∈ ℤ) (n ∈ ℤ)
    { valid<*mut u8>(p) ∧ valid_write(p, n) ∧ valid<usize>(n) ∧ safe<usize>(n) }
    randomize(p, n)
    { valid_write(p, n) }

In the "contracts as types" style, we have a type in a richer type system:

randomize: Π {p: ℤ} {n: ℤ}
    ( ptr: valid<*mut u8> | repr<*mut u8>(ptr, p) ∧ valid_write(p, n) )
    ( len: valid<usize> | repr<usize>(len, n) ∧ safe<usize>(n) )
    ->
    ( _: safe<()> | valid_write(p, n) )

In particular, the invariant of the occurrence of *mut u8 in the signature of randomize() is not the safety invariant, but something stronger, because it is constrained by being valid for write, while the invariant of the occurrence of usize is the safety invariant, because it's unconstrained and thus uses the default type invariant which is the safety invariant.

It's important to note that type occurrences constrained by the safety documentation refine the validity invariant, not the safety invariant (in particular, the safety invariant is the default refinement of the validity invariant). This matters to express the type of Pin::get_unchecked_mut() (in this richer type system):

get_unchecked_mut: Π {p: ℤ} ∀ (T: Type)  // quantified over semantic types
    ( self: valid<Pin<&mut T>> | repr<Pin<&mut T>>(self, p) ∧ safe<Pin<&mut T>>(p) )
    ->
    ( res: valid<&mut T> | repr<&mut T>(res, p) ∧ pinned<T>(p) )

We can unfold the definitions of valid, repr, and safe on Pin<&mut T> to confirm that we have the identity function (and this function is just a cast):

get_unchecked_mut: Π {p: ℤ} ∀ (T: Type)
    ( self: valid<&mut T> | repr<&mut T>(self, p) ∧ pinned<T>(p) )
    ->
    ( res: valid<&mut T> | repr<&mut T>(res, p) ∧ pinned<T>(p) )

If pinned<T> is different than safe<T>, then the type invariant of the result is weaker than the safety invariant (but still stronger than the validity invariant). This means you cannot call safe functions with the result value, you have to call robust functions.

I don't think we need to come up with a new term for this very well-established concept.

Indeed we don't. It's just a matter of style, which is why I said "This is somehow a matter of style, so not that important". Some people like to prove properties of their programs using pre-posts, others using types.

We don't need a new term, but we need to understand the concept to understand when the parallel between language/library UB and validity/safety invariant does not hold. You can violate the safety invariant without library UB (robust functions) and you can get library UB without violating the safety invariant (unsafe functions). While violating the validity invariant is always language UB (the reciprocal is true when restricting to the type invariant section of the language contract). Similarly, violating the "soundness" invariant is always library UB.

They're both part of the standard, but it really means what you mean by "part".

Indeed, if one can split the language specification Full in a smaller language specification Lang and a standard library specification Lib, such that it's possible to implement Lang, it's possible to implement Lib on top of Lang, and it's possible to implement Full on top of Lang and Lib, then we're also splitting the notion of Full UB into Lang UB and Lib UB. I agree this makes the distinction rather blurry and essentially implementation-defined (the Full implementation may decide to implement it all at once or split, in which case Lang doesn't know about Lib and can't optimize specifically for it, only generically as for any other function by looking at the function implementation and "reversing a specification" to use at call-sites).

@RalfJung
Copy link
Member Author

RalfJung commented Jan 11, 2025

Those can be seen as the same thing, it's a matter of style

I strongly disagree with this. You are causing a completely unnecessary terminology confusion here. Using "invariant" for "has to be true when this type is passed across API boundaries (but exceptions apply within a function)" is fairly standard (in safety/library invariant), even wikipedia calls this an invariant. We are already stretching the notion of "invariant" by using it for things that have to be true "on every typed load" (in validity/language invariant), but in some sense this is similar to the former use of invariant -- it's something that has to be true for all operations on this type.

But calling the precondition of a specific function an invariant is unheard of AFAIK and I will strongly push back against that. We use words that have specific technical meaning for a reason.

@ia0
Copy link

ia0 commented Jan 11, 2025

But calling the precondition of a specific function an invariant is unheard of AFAIK

I'm not calling the precondition an invariant. In the "contracts as types" style, there are no preconditions, there are just types. Pre- and post-conditions only exist in the "contracts as pre-posts" style. However, I claim there is a parallel between them, which can be seen with those 2 sentences:

  • "contracts as pre-posts": Violating the preconditions of a public API is library UB.
  • "contracts as types": Violating the documented types ("soundness" invariant) of a public API is library UB.

Given you mention the "on every typed load" particularity of the validity invariant, it's also interesting to see that this is also true for the "soundness" invariant. You can annotate your program such that every typed load satisfy the "soundness" invariant (which implies the validity invariant). This gives us the parallel you want:

  • Violating the documented types ("soundness" invariant) of a public API is library UB.
  • Violating the "erased" types (validity invariant) of a language operation is language UB.

The difference with the validity and safety invariants, is that the "soundness" invariant respects subtyping (hence can be used to prove the program or library sound in a type system fashion). One never casts a value from T to S if the "soundness" invariant of T does not imply the one of S. While one can do so with the validity and safety invariant. You can cast from u8 to bool if the value is 0 or 1. With the "soundness" invariant, such a type cast is the identity function, it's the same "soundness" invariant on both sides, only the backing type is different. You use subtyping when you need to ignore information, for example when calling a safe fn(bool) with true, you subtype the "soundness" invariant of the constant (which is the singleton true) to the safety invariant (which contains both true and false).

@CAD97
Copy link

CAD97 commented Jan 12, 2025

The difference here, AIUI, is that when @RalfJung or I say «type», we are referring to the syntactic type that is written (or inferred) in the Rust source code. Whereas @ia0 is using «type» to mean roughly the specific-to-an-API-signature semantic refinement of the «validity invariant» which captures all of the conditions for the soundness proof of that API signature.

Warning

Disclaimer: I worded this a bit authoritatively, but it isn't intended to be final, nor is it the expressed opinion of anyone in the mentioned groups other than myself.

@ia0 — the work you've put in developing your unsafe mental model document is appreciated, and I do believe it's a useful resource; however, it is not shared vocabulary with T-opsem or the UCG. Surface Rust, MIR, nor MiniRust have any concept of refinement (sub)typing below the syntactical (language/validity) typing, so the attempt to use your modeling of API contracts as refinement types to influence the common language which doesn't utilize refinement such refinement typing fails to translate properly.

I think we've identified the ideal resolution here:

I'm fine with "library invariant", I understand it as "default library invariant" or "library definition invariant".

Standard documentation can use “language invariant” and “library invariant” to refer to the (syntactic) type requirement that is required for a typed copy1 or for other safe API2 to be actually safe to use, respectively. And your doc which extends Rust's unsafety model with a more formal reasoning model for managing the unsafety in a program can go into refinement typing for attaching unsafe contracts on top of syntactic types, and how under this model the “library invariant” is just the default refinement for a syntactic type unless otherwise specified.


Also keep in mind that the terms discussed here are very much intended to be used as the “surface level” terminology that gets seen by everyone who writes Rust and uses the stdlib documentation. It's important for it to not be wrong so the path to the full details of safely using unsafe API is reasonably simple, but it's also fully acceptable for there to be edge cases that require further details and aren't perfectly represented by the base level split between “true because the language requires it” and “true because the library code requires it.”

And FWIW, I think “library invariant (of the syntactic type)” is a perfectly fine semantic for your model to replace with “safety invariant (of the specific flow-dependent type refinement at this point of the execution)” as your model is fully about replacing the syntactic type with the semantic type used for soundness reasoning.

Footnotes

  1. Unless otherwise specified. You may claim that language validity cannot be opened, and it can't directly be opened by the user, but e.g. MaybeUninit and MaybeDangling both are instances of the language providing syntactic types which open specific invariants required for typed copies by the language. The split is not “can't be opened” versus “can be opened” but rather “… by writing careful enough code.”

    The «validity invariant» as used by the UCG was always just the simple byte representation requirement for typed copies in the AM. You've reused the term in your unsafe model doc as the fully inviolable syntactic type invariant, but this is still a novel usage and not how the UCG has utilized the term.

    Plus, Rust always reserves the right to make any current UB into defined behavior. It is not endorsed reasoning to use the validity invariant as justification for doing anything but a typed copy; you are only allowed to do further operations because their requirements are part of your API’s safety requirements.

  2. Including other potentially implicit operations, e.g. dereferencing or casting. Relaxing requirements to give more code defined behavior is always allowed (as long as said code actually has defined behavior).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests