-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚲🏠 Renaming validity and safety? #95
Comments
One reason why I suggested "initialziation invariant" is because it connects with
|
Also note that the validity invariant only applies to data that the compiler considers initialized -- this has been a point of confusion several times. Having something with "initialization" in its name should help clarify that. |
In the "Machine invariant" seems like a reasonable choice. But then |
On one or two past threads I've suggested that we use "sound" to mean "has defined behavior" and "unsound" to mean "is UB", and I still think that's a good idea. So by analogy, I'd suggest renaming the "validity invariant" to the "soundness invariant". At least to me, that makes it immediately obvious what it's supposed to mean: if you violate this, then your code is broken, full stop. Regardless of which particular word we choose, I do think that it's strongly desirable that "X-ness invariant" corresponds to "if your code is X, it does not have UB". I would not be surprised if "sound" and "unsound" have different connotations to other people and we have to come up with something else, but if we end up agreeing on (for example) "validity invariant" for this, I believe we should also agree to start using "valid" with the very specific meaning of "does not have UB". IMO, that's the biggest disadvantage of "initialization" or "machine" invariant: it clearly doesn't make any sense to say "if your code is initialized/machined, it has no UB", so we can't hope to kill two birds with one piece of jargon that way. Also, to me "machine invariant" sounds very strongly like it's trying to distinguish hardware properties from software properties, which is presumably not a distinction the Rust abstract machine would make. It didn't even occur to me the "machine" in "machine invariant" might refer to the Rust abstract machine until I re-read Centril's comment above. |
Code that does not violate this isn't necessarily sound, e.g., if it violates the safety invariant. |
"Initialization" is super confusing. The other day some reddit users where super confused about why this is always correct: Their argument was that the documentation of This issue was raised multiple times during the RFC, FCP, etc. of MaybeUninit and I still have no idea why we ended up in a place where we have to explain people in Rust that sometimes "uninitialized memory is initialized", but I rather don't introduce the same issue in the UCGs. If someone wants to introduce the same issue here, they should IMO make a case again for it. I'm personally fine with using the term "validity", and I believe that any Rust programmer dealing with unsafe code and uninitialized memory is smart enough to learn a new word and its meaning, particular a word that is used many times throughout the spec. This means that when reading the spec they will probably have to first unlearn the "initialization" terminology used in the API docs, then learn a new word and its meaning, and afterwards mentally replace "initialization" in the API docs everywhere with that word for anything to make sense ("uninitialized memory is initialized" vs "uninitialized memory is valid at type T"). This situation is unfortunate, but better than introducing here another overloaded term like "machine" that might end up confusing users in different ways that "initialization" does today.
What does the Why can't it just be A |
I disagree strongly. Why invent new terms for something that already has a term, namely, "has (no) UB"? If "has no UB" sounds too negative, an alternative positive wording I sometimes use is "has well-defined behavior". Worse, "sound" is already used to mean something else: A safe function / API is unsound if it is possible for safe code using it to cause UB. Otherwise, it is sound. EDIT: I propose we add this term to the glossary: #190
"code" isn't "initialized", that makes no sense indeed. But I think this is actually in fact another argument against your terminology: code can have UB in other ways than violating the validity invariant! Say we have code that uses |
I admitted exactly that a few posts up. :)
The name of what is now called Multiple people reviewed the stablization PR and also the others about documenting Yes, you raised a concern related to this. You made an "excruciatingly verbose" proposal but said yourself "I don't know whether the painfully excruciating approach is worth pursuing". And moreover, it's not even the same concern -- that one was about validity not being enough because there is also the safety invariant, now the issue is that "uninitialized memory" can be considered initialized. I do not think anyone raised that particular concern before stabilization. Prove me wrong.
The issue is that "this is a valid
This was discussed pre-stabilization, the argument was that
"machine" is indeed not suited as an adjective. We'd have to find another short-hand for "satisfies the machine invariant".
We will have to disentangle "(un)initialized memory" from "(un)initialized variable (of type |
People in that thread were confused about the "initialized" vs "validity invariant" semantics of "into_init" because it has "initialized" in its name. I thought this issue was important enough to raise it in the first sentence of the comment you just cited: "I think it is worth it to call out again that the only thing that into_init asserts is that the value satisfies the validity invariant of T, which is not to be confused with T being "initialized" in any general sense of the word." I proposed using I felt the issue fell in deaf ears, and the throughput of the discussion was too high for me to keep up with it, so I left it at that. So this precise issue was raised there and known, and from what I understood it was an explicit design decision to use "initialized" instead. Even though it could be confusing, as that discussion showed. |
You are quoting without context. The post goes on demonstrating the issue with an example of something that satisfies the validity invariant, but not the safety invariant. So, my reading of what you said -- and I still interpret your post the same way when I re-read it now -- is that "initialized" above means "safety invariant". Your concern (or rather, my interpretation of your concern) was that someone could "initialize" a If you meant that particular post to raise the concern that "uninitialized memory can be initialized because some validity invariants allow that", then we heavily miscommunicated. As far as I can see, nothing in your post indicates that you are worried about the interaction of uninitialized memory and
I was responding to the issue that Note that this is not the problem that is biting us now.
|
The other problem with enshrining "valid" anywhere in the API is that there is no consensus yet (that I know of) that this is the term we want to stick to -- source: this issue is still open. I'd be totally up for that. Of course the API is fixed now (we could consider renaming If we decide that the validity invariant arises from typed copies alone, another good term might be "representation invariant": in that case, a bit string is "valid" for a type if and only if the bit string represents some value at the type. However, this would preclude some options in #77: the representation relation is "pure" and cannot demand things like "the pointer must not be dangling" or "the pointer must point to memory that is valid for |
If this is the case, you are completely misunderstanding both the post, and also the confusion that the users in that thread had back then, as well as the confusion that people still have now. By "initialized"/"uninitialized" most users just mean whether that memory was touched or not (was something written to it?). They don't know that there are things like "validity" or "safety" "invariants", or any other kind of "invariant" for that matter. They are literally just confused that the memory was not touched or written to, therefore it is uninitialized, and claiming that it is initialized is incorrect, becasue that would mean that something was written to it, even if it is all just zeros (independently of whether anything has to be written at all). |
To expand, the confusion of users is orthogonal to whether something must actually be written at all, or whether what's written satisfies the validity or safety invariants, e.g., many users do believe that let mut x: MaybeUninit<T> = MaybeUninit::uninit(); // x is uninitialized
x = MaybeUninit::zeroed(); // x is initialized holds for all |
I don't recall anyone being confused about this (uninitialized memory), do you have pointers? And I do understand that the discussion right now is about uninitialized memory. But where does your post, your example back then indicate that you are worried about uninitialized memory? The example specifically was about creating an instance of a type that is valid but not safe. We do agree that that is not the issue right now, right? We have two confusions about
I think these are very distinct cases. In the first case memory is initialized, it is even valid, but then the issue is that there's another invariant, the safety invariant, that is violated. In the second case, memory is not initialized and yet it is valid. If you meant to raise the "nested |
If people think that I proposed back then to use "initialized" as proxy for "good enough to be taken out of the Where this failed -- what I did not foresee and what nobody raised before stabilization -- is that sometimes, our "initialized" is actually less than "something was written there" (the "nested |
Gentlemen, the historical details of how we arrived here aren't very important or particularly worth arguing about. We are here, we just need to decide how to move forward with the situation we've got. Here's some points I'll throw into the pile:
From this my takeway is that "Initialization Invariant" is probably not a good term to try and transition to. Telling people that initialized memory might not meet an initialization invariant is a little funky, and then explaining on top of that about a safety invariant that also might be required means you have to back up and admit that the initialization invariant doesn't include all of the safety rules that a fully initialized value of the type must follow. It quickly becomes a mess. I don't think we need to transition to any new term at all. Just improve the explanation of Validity Invariant and Safety Invariant by a bit in the glossary. EDIT: Ralf you posted while I was typing and covered some of the same points as me. XD oh well. |
@Lokathor so you are suggesting "validity invariant" and "safety invariant" are good enough terms, and we should start to use them "for real" because the concepts are already out there and it is important to get consistent and precise terminology? I won't disagree. But this will need consensus, maybe even involving the lang team. These are key concepts of the language. (Similar to "soundness", which I recently had to see someone use in a different way so I opened #190.) |
Right. I think the best path is to continue with Validity/Safety as terms, and then build consensus around those terms and increase education of those terms. |
For these users, let mut x: bool = uninitialized();
x = true; was ok because they don't think about validity or safety invariants. They just see uninitialized memory and initialized memory, and as long as you don't use it as such, they expected uninitialized memory to be fine. That's the exact same confusion that we are seeing here, where MaybeUninit usage of the word "initialized" means something very specific, but Rust users understand it differently depending on their background. Some people might understand it as initialized in such a way that it does not violate safety, while others understand it as uninitialized/initialized memory (that's the specific issue that the reddit users were having in the reddit thread).
Where was this argument made? I definitely raised that issue, but this wasn't the answer I got, e.g.,
First sentence, emphasis mine: "I think it is worth it to call out again that the only thing that into_init asserts is that the value satisfies the validity invariant of T, which is not to be confused with T being "initialized" in any general sense of the word." Safety or validity invariants are extremely Rust specific, and not even general Rust terms yet. Using "initialized" to mean any of those was predestined to confuse every user that attempts to transfers what "initialized" means in their dictionary or in the PLs they are familiar with. Particular C and C++ programers, where
AFAIK, these issues were raised back then, therefore they are not new, and the decision to use the "initialized" terminology balanced the tradeoff of being potentially confusing to some users with other tradeoffs like using a familiar word that users do not have to look up. If the RFC and FCP process in which the lang-team approved the "initialized" terminology did not consider the issues being raised about how confusing that terminology is, then these could be considered new issues, and we can change old decisions if new issues are raised. Otherwise, we cannot. And the open question is whether the spec should deviate from the terminology used in the MaybeUninit API. I think it is unfortunate that the words used in the MaybeUninit API confuse some users, but instead of using confusing terminology in the spec I think it is ok to just add a note to the glossary where the validity invariant is explained and mention that the MaybeUninit API calls this "initialized" or something like that. |
I think that's a very far-fetched statement. There are lots of subtleties around uninitialized memory, and you are taking all of them and claiming they are the same. Just because two people are confused about the same thing doesn't mean they are confused the same way, and doesn't mean the same approach is best to help them. That particular issue that you are now bringing up (which has basically nothing to do with what we were discussing before) was about someone not understanding that validity is required even on a typed copy, not just when the value is "used" in some obvious sense of the word. That doesn't even have anything to do with In fact I think that the "initialized" terminology helps here because then we can say that a This plan went fine, until people pointed out that sometimes uninitialized memory can be an initialized variable. And that's what triggered this discussion.
You said this already, and I responded to it before in great detail. Instead of repeating what you said, please reply to my post where I argue why that is a totally different problem that you raised there.
It considered the existing issues that were raised for why the terminology is confusing ("the But after stabilization, new issues were discovered ("the nested
Hm, I was sure I read this somewhere, but it's very hard to find anything in these discussions, and maybe I was wrong. Also remember at that point the type naming bikeshed in rust-lang/rust#56138 was 3 months done -- and in that bikeshed, I see you propose |
FWIW, I don't like any of the alternate names. The name |
All right, so I think we all agree that "initialization invariant" is a bad name for UCG purposes. We are having an argument about how foreseeable that was, but leaving that aside we have to see how we proceed. It is used in
So, for step 1, which contenders do we have?
|
I have no idea why you believe this is new evidence, and I disagree that it is, since it was pointed out, e.g., in the bikeshedding discussion, that using "uninitialized" as a word obscures that this is about "validity" (e.g. here), and sub-sequently, once it was decided that "uninitialized" should be used, it was requested that at least the documentation should talk about validity, because uninitialized-ness is irrelevant (e.g. here).
👍
I don't think the churn is worth it here. The brief API docs should use the term valid, and the body should explain it. |
I agree with @Lokathor's sentiment that contrafactual arguments are not productive so let's please end that argument and start from the facts that cannot be changed. ;)
I think that the language team is more interested in the concept itself rather than the label so I don't think FCP is necessary. Curious to hear what @nikomatsakis thinks. I think renaming the
If not "initialization" I think this is a reasonable choice (though I'm biased) and
There was the "representation invariant" which makes sense if the invariant fits the wording. Maybe what we need now to gather all the pros & cons for the various choices in a document? |
My understanding of Validity Invariant has been that it meant references and typed accesses and such must have "valid bits for a value of that type to have". |
The problem with "initialized" is that everybody thinks they know what it means so they don't look it up and end up being wrong in subtle ways that bites them in the form of undefined behavior. This happens to Rust lang team, library team, ... team members, experienced users, and new users alike. When "valid" was proposed, most people said is too vague, or asked: "Valid for what?". I agree that it is too vague, and some people might think they know what that means, and get it wrong. However, the people that said "Valid for what?" were all pointed to our definition in the glossary, which is a 10 seconds read, and then they knew precisely what it meant. The proposal "machine-valid" makes it less vague, but it also incurs the danger of people thinking "Ah, valid for the machine, that is, the hardware, and We need a word that is easy to use and remember once it is learned, but completely opaque for somebody that hasn't looked up its precise meaning, so that they will invest the 10 seconds it takes to look it up. |
From the two words we have used for this: many people actually asked "Valid for what?" but nobody asked anything about "initialized" (instead they just were surprised when it did not mean what they though it did). That makes "valid" and "validity invariant" much better terms from my point-of-view than "initialized" and "initialization invariant". I think it would be worth it to measure proposals to replace "valid" and "validity invariant" along this axis as well. |
I'm starting to think this is a chimera -- "representation invariant" might be the only term that would not have any problems but on the other hand then my understanding is that we must also answer #77 in a certain way for the term to make sense.
If you want we can disambiguate to "abstract-machine-valid" or "virtual-machine-valid" (shortened to AM-valid or VM-valid respectively). Many won't know what the initialism "AM" will stand for but at least they will not think they know what it means so they will look it up. |
I think "abstract-machine-valid" or "virtual-machine-valid" are better and I don't have a better proposal, but they are also quite a mouthful and I fear that we will just inadvertently abbreviate them as "valid", bringing us back to square one. |
There might be no substitute for code review here to root out inadvertent abbreviations to "valid". |
I wasn't even talking about the spec itself or the API docs, but just informal discussions, e.g., on Zulip, Discord, the forums, reddit, etc. Maybe everybody will stick to the script and write "abstract-machine-valid" or "AM-valid" everywhere, but I would find it reasonable if people abbreviate it to "valid" here and there, and this might result in "valid" becoming the de-facto term independently of what the spec or the docs say. |
That is absolutely what will happen, there is no doubt. |
@gnzlbg Maybe; I don't find "AM-valid" too burdensome. I suspect people will abbreviate to "valid" after having introduced "abstract-machine-valid" previously. E.g.:
which is fine IMO. |
A good point. This is a great name IMO if that's really how we end up justifying the entire validity thing. Which I think we should. |
Currently, we de-facto use initialization invariant to refer to "validity invariant", but that is very confusing as some types do not required initialized memory to be "initialized"... so I think we can scratch that idea (still mentioned in the OP here). Here's a naming scheme that occurred to me yesterday: language invariant and library invariant. After sleeping over it I still like it so I figured I would share it here. |
@RalfJung I do like that. |
I'm a fan of staying with "safety invariant" because it relates to the idea, "this is what you need to uphold for safe code to interact with your data" |
@Lokathor that's not entirely accurate though -- it's more, "this is what you need to uphold across a safe abstraction boundary". Safe code that you control may well work with data that violates the safety invariant. |
Well, code you control can always potentially go above and beyond the norm. I suppose to be more precise you'd say "this is what you need to uphold for unknown safe code to interact with your data". I suppose I just like it to more clearly line up with the safe / unsafe terms rust already has.
|
Yes, the "unknown" part is crucial. |
Closing for backlog bonanza. We probably can't change this and it doesn't seem useful to track in an issue |
It has been noted on several occasions, e.g. rust-lang/rust#53491 (comment) and rust-lang/rust#53491 (comment) that "valid" is prone to confusion with "safe" due to the way the former sounds. In particular, it raises the question "Valid with respect to what"?
Thus it is probably a good idea to rename one or the other into something else.
@RalfJung suggested "initialization invariant" (because any initialized
T
must satisfy such invariants) as a replacement of "validity invariant" while I suggested "machine invariant" (because it signifies the invariants the abstract machine requires of aT
). If we rename validity to either of those then "safety" might not need renaming; but if it does, "type system invariant" may be one candidate.The text was updated successfully, but these errors were encountered: