Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unions #1444

Merged
merged 14 commits into from
Apr 8, 2016
Merged

unions #1444

merged 14 commits into from
Apr 8, 2016

Conversation

joshtriplett
Copy link
Member

@joshtriplett joshtriplett commented Jan 5, 2016

RFC: native C-compatible unions via contextually recognized keyword union

EDIT: After extensive discussion, and grammar experiments by @nikomatsakis to verify feasibility, this RFC and pull request now proposes recognizing union as a "contextual keyword", allowing union to introduce a union declaration while not breaking any existing code that uses union as an identifier.

As discussed in the alternatives section, proposals for unions in Rust have extensively explored possible variations on declaration syntax, including longer keywords (untagged_union), built-in syntax macros (union!), compound keywords (unsafe union), pragmas (#[repr(union)] struct), and combinations of existing keywords (unsafe enum).

Rendered

Discussion on rust-internals

(edited by @nrc to add old title)

(edited by @mbrubeck to link to final rendered version)


A union may have trait implementations, using the same syntax as a struct.

The compiler should warn if a union field has a type that implements the `Drop`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning or an error? I assume that the destructor of the field would not run when the union is dropped, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to make it an error, yes. However, Rust does not consider leaks or failing to run a destructor unsafe behavior, per the discussion that occurred around scoped threads. See the documentation of std::mem::forget.

So, I assumed that people would object to making this an error. If not, then I can quite happily change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to forbid Drop types for now. We can always change it to allow them later if there turn out to be compelling use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nrc nrc added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jan 5, 2016
@nrc
Copy link
Member

nrc commented Jan 5, 2016

It might be worth summarising some of the discussion from the internals thread - there is a lot of it and it's not easy to follow the threads of conversation.

I strongly agree that we should support untagged unions in Rust. However, I think that #[repr(union)] enum is the better way to do it - it is more flexible since the variants can have multiple fields, it saves us adding another data structure, and I think unions fit the enum intuition more closely than structs.

Unions as enums works particularly nicely if are able to use variants as types (which may well be a long way off or may never happen, sadly). In that case only the downcast from the enum type to the variant type has to be unsafe (which it would be for any enum, I imagine) and then other use of the variant can be in safe code. In this case, the only difference for repr(union)/unsafe enums is that you can't match them.

@joshtriplett
Copy link
Member Author

See the mention of unsafe enum in the alternatives section; @retep998 wrote an RFC for that.

I can certainly see the argument for that, given that Rust enums represent tagged unions. However, modeling untagged unions on enums produces some syntactic challenges, though. How do you access a field of a union? Enum normally only supports pattern-matching syntax; since the pattern-matching requires unsafe code, pulling out a field F would require something like this: let f = unsafe { let MyUnion::F(f) = u; f };. That gets uglier if you have structs inside unions inside structs, which FFI interfaces actually use quite often (a struct of simultaneously valid fields, inside a union of many such structs, inside an outer struct containing a tag and other common fields). At that point, pattern matching for field access starts to approach the level of syntactic complexity introduced by using macros to define and access union fields; in particular, it does not naturally chain as well, and it requires mixing reading right-to-left with left-to-right ordering.

I suspect such syntax would also drive people to include more code in the unsafe block than necessary.

By contrast, field access syntax would simplify that to let f = unsafe { u.f }; And even with nested structures and unions, you'd have something like let f = unsafe { s.u.f.x };.

As discussed in the rust-internals thread and mentioned in the alternatives section of this RFC, you could potentially support struct field access syntax with unsafe enum, allowing access to the fields using dotted notation. However, that would make unsafe enum's syntax differ from the obvious expectation someone would have by looking at its definition, and dotted field notation doesn't make as much sense for safe, tagged enums. A new construct doesn't come with those syntactic expectations (and anyone coming from C will expect a union to use struct-like syntax).

An unsafe enum with multiple fields would also adds to the complexity of supporting field access syntax: if one constructor of the unsafe enum looks like foo(u32, f32), how would you name the two sub-fields? u.foo.0 and u.foo.1? Most FFI interfaces name the sub-fields, which would require using a separate struct anyway; in that case, would you have to write u.foo.0.subfieldname?

Writing to fields seems similarly more complicated with enums.

As a minor additional nit, Rust warns by default for enum constructors that start with a lowercase character; many FFI interfaces would end up needing to disable those warnings.

I think the case of defining an inline structure would work better with an RFC for anonymous struct and union types; I'd be quite happy to write such an RFC as well. Many FFI interfaces will want those anyway, for the common case of a struct containing an anonymous union. However, I don't think that should form part of this RFC; I would suggest a followup after resolving this one. In the meantime, it seems simple enough to define a struct (or tuple struct) and make that a field of the union.

All that said, I could live with unsafe enum or similar along with struct field access syntax; it doesn't seem as intuitive to me, but I'd take it over not having native support for unions at all. (Only having pattern matching, by contrast, seems like a major wart for ergonomic union usage.) As mentioned, I don't care that deeply about the syntax for declaring unions; I care a lot more about the syntax and semantics for using them.

@joshtriplett
Copy link
Member Author

It might be worth summarising some of the discussion from the internals thread - there is a lot of it and it's not easy to follow the threads of conversation.

As far as I know, I've captured all the major threads of discussion (including alternatives raised and the reasons for them) in the alternatives section. If I've missed one, I can certainly add it.

The largest discussion was between unsafe enum and struct-like syntax (originally presented as #[repr(union)] struct, though I've now presented it here with a new keyword instead).


If a union contains multiple fields of different sizes, assigning to a field
smaller than the entire union must not change the memory of the union outside
that field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the memory of the rest simply become undef?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, what happens here:

untagged_union X {
    a: u8,
    b: u16,
}

let mut x = X { b: 1 };
x.a = 1;
let y = x;

Does the compiler have to copy the unused part?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying a union into some other variable must always copy the entire memory of the union, unless the compiler can prove that nothing reads from other fields of the destination, in which case it could potentially elide moving some data around.

For instance, if you pass y to an FFI function, Rust can't know what parts of the union you intend to read, so it needs to copy the whole thing. On the other hand, if you pass y to a Rust function, and rustc can see that the called function only reads y.a, never y.b, then rustc could potentially elide the copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying a union into some other variable must always copy the entire memory of the union

Why? Simply make accessing any variant but the one that was written to last undefined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would break many valid usages. For instance, consider a union of a common_header struct and several structs that start with that header; writing to common_header should not invalidate the rest of the data. Ditto for many other common patterns used with unions.

Note that factoring the common header out of the union does not solve the problem. For instance, you might have different types of common headers used for subsets of other fields. And in general, moving fields into or out of a union could require platform-specific understanding of size and alignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have some examples of such code to see how unions must behave.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial example:

struct S {
    header: COMMON_HEADER,
    otherfields: SOME_TYPE,
}

untagged_union U {
    header: COMMON_HEADER,
    s: S,
    // ...
}

Writing to u.header (or fields of u.header) should not invalidate u.s and in particular u.s.otherfields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean open source C code from well known projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my other reply, MSVC does support writing to one variant and reading from another, which means that writing to one variant does not invalidate the non-overlapping bytes of other variants. So regardless of what the C standard dictates, we'd have to support this case on Windows at the very least, and I'm sure other major C compilers behave similarly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahkoh ACPICA includes a union with that exact pattern; see "union acpi_object" and ACPI_OBJECT_TYPE in https://github.com/acpica/acpica/blob/master/source/include/actypes.h .

@nagisa
Copy link
Member

nagisa commented Jan 5, 2016

I would rather have tuple structs with a #[repr(union)] instead. The only thing that needs to be figured out is initialisation then.

@retep998
Copy link
Member

retep998 commented Jan 5, 2016

I am desperate to have untagged unions in some form to make my life easier, and I support this RFC being accepted and implemented as soon as possible.

@mahkoh
Copy link
Contributor

mahkoh commented Jan 5, 2016

Since forget is safe, how about allowing Drop types but guaranteeing that the destructors are not run? Then the user can still use drop_in_place to manually drop the correct variant.

@joshtriplett
Copy link
Member Author

@mahkoh: I'd like to at least see a warning about it, absent some explicit indication of "meant to do that". I don't deeply care whether it uses a warning or an error; @sfackler suggested making it an error for now, so I did.

@joshtriplett
Copy link
Member Author

@nagisa For FFI purposes, we need named fields. A tuple version of untagged_union seems easy enough to support as well (using the same declaration syntax and field access syntax as a tuple struct), if you'd find that useful. However, I'd like to avoid expanding this RFC for a first pass unless there's a strong demand for "tuple unions"; perhaps a followup RFC after this one? The bigger this RFC gets and the more it adds, the more it takes to get accepted.

@nagisa
Copy link
Member

nagisa commented Jan 5, 2016

@joshtriplett that was more of a “instead of” comment. IMHO you should not expose these untagged unions to safe rust code anyway, so either way is fine, I guess, but I’m not sold by the named fields argument.

Also, for additional point (assumingly, a negative one), untagged unions is something @graydon said he was happy to see rust 1.0 to ship without.


All in all, my general opinion is that we should have some easier way to produce opaque untagged unions (essentially an opaque struct occupying as much space as necessary), perhaps implemented as a macro; but not the full blown way to do (and abuse) untagged unions.

@retep998
Copy link
Member

retep998 commented Jan 6, 2016

@nagisa Macro solutions are what I already use for unions, and they're a pain to write and use and ensure that they are correct.

@joshtriplett
Copy link
Member Author

@nagisa: I absolutely agree that we shouldn't expose untagged unions to safe Rust code; only unsafe code can do anything interesting with them.

I've seen the macros @retep998 refers to; we desperately need a portable, safe solution.

@ohAitch
Copy link

ohAitch commented Jan 6, 2016

The @graydon post definitely suggests unsafe union. Null pointers and array "overruns", after all, are also not difficult to achieve in unsafe code, being similarly necessary for C FFI.

@ohAitch
Copy link

ohAitch commented Jan 6, 2016

(The library I'm to interface with has essentially a tagged union { u31 direct; 0x80000000 | mpz_t* indirect }, where mpz_t is the GMP bigint struct. This RFC seems like the correct approach.)

@nikomatsakis
Copy link
Contributor

@nrc

It might be worth summarising some of the discussion from the internals thread - there is a lot of it and it's not easy to follow the threads of conversation.

I skimmed the conversation but don't feel I can really summarize it. Still, here are some posts at least that I found interesting:

I've not yet read the final RFC to see where it ended up though!

@retep998
Copy link
Member

retep998 commented Jan 6, 2016

Microsoft uses unions in two ways.

  1. A union inside a struct which has a tag field. In this case the user would only need to access the union from the variant that it was initialized with. There are no doubts about this having well specified behavior.
  2. A union that provides several representations of the same data. Such as having one variant be a UINT64 and the other variant a struct of four USHORTs. In this case, the user is expected to be able to write to one of the variants and read from the other variant. Even though the behavior of using a union like this isn't specified by some set of the C/C++ standards, it is specified by MSVC. Thus Rust would need to support it too.

@mahkoh
Copy link
Contributor

mahkoh commented Jan 6, 2016

Even though the behavior of using a union like this isn't specified by some set of the C/C++ standards

This is covered by footnote 95.

If the member used to read the contents of a union object is not the same as the member last used to
store a value in the object, the appropriate part of the object representation of the value is reinterpreted
as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type
punning’’). This might be a trap representation.

@joshtriplett
Copy link
Member Author

@mahkoh Thanks for looking up the exact reference in the C standard! That's exactly the behavior I'd expect in Rust as well.

@graydon
Copy link

graydon commented Jan 6, 2016

As I was mentioned here, some opinions (caveat: still not a core-team member, just opinions):

  • I'm fine with this as an unsafe FFI feature. I don't think anyone's proposing it as safe, so no problem.
  • I'd pretty strongly prefer doing it as an attribute like #[repr(union)] struct rather than a new keyword.
    • It is just a question of representation for a structure-with-fields. The representation is "the fields overlap". It's a weird fact of the representation, but memory changing underfoot is a fact of life in unsafe-land, for many reasons. Cf. multithreading, mmio.
    • Since they're inherently unsafe, the only function that ever sees them will be a wrapper that enforces safe access, and is very aware of their union-ness. Only the -sys code. They won't be drifting into safe library-user code. So I disagree that these will cause any user confusion
    • Yet more C-interop FFI tricks may yet wind up in repr(...); it's the right place for this. Cf. RFC 314, bitfields
    • I think the bar for new keywords is / should be extremely high, post-1.0
    • The question "what is the language versioning plan?" is interesting but unrelated since nobody asking for this feature would likely want to wait until 2.0, which might be years off. Meanwhile imo it's bad precedent to "allow new keywords, so long as they're suitably long/improbable/unwieldy".

@joshtriplett
Copy link
Member Author

@graydon I have no objection whatsoever to switching back to #[repr(union)] struct rather than a new keyword.

@graydon
Copy link

graydon commented Jan 7, 2016

Didn't assume you did 😄 Just. Hrm. Sigh. Re-reading my words. I'm sorry, I need to figure out some way of participating in this community that doesn't come down like a ton of bricks / overstate my case all the time. I tried rewriting that 4 times and even still it's too pushy. Bleh. I realize my words carry more weight than they should. I haven't contributed a line of code in over 2 years now!

@joshtriplett
Copy link
Member Author

@graydon: I didn't assume you assumed I did. :) I'm just trying to reiterate that I really don't care what color the declaration-syntax bikeshed is painted, just the syntax and semantics for usage.

Thus far, that declaration syntax seems like the bit that has produced the largest arguments.

I appreciated your comment, and I found it a compelling argument. The only thing making me hesitate to change the RFC on the basis of that argument is that I've also received one vehement complaint (via IRC) against any syntax that uses struct.

@joshtriplett
Copy link
Member Author

I do feel a bit uncomfortable having a type that can be written, but not instantiated

I don't think empty unions should act that way; I would suggest that union U {} has one value, U {}, just like the empty struct or empty tuple.

I think the way forward is to implement without empty unions and for a follow-up RFC to discuss them.

Agreed.

@canndrew
Copy link
Contributor

If we do allow them, I think it makes sense to treat them as having one possible value (U {}), rather than zero possible values.

Except U {} isn't a possible value. The RFC clearly states

A union instantiation uses the same syntax as a struct instantiation, except that it must specify exactly one field:

And U {} doesn't specify exactly one field. I think debating whether unions are more "enum-like" or "struct-like" is a red herring. Unions are union-like, if we want to know what the semantics are for a zero-element union just look at the semantics we've defined for an n-element union and set n to zero.

@canndrew
Copy link
Contributor

canndrew commented Aug 29, 2016

Specifically..

Instantiating

An n-element union can be instantiated in n different ways by specifying one of its n fields.

n = 2 n = 1 n = 0
union U2 { f0: T0, f1: T1, } union U1 { f0: T0, } union U0 {}
let u: U2 = U2 { f0: foo }
let u: U2 = U2 { f1: foo }
let u: U1 = U1 { f0: foo }

A zero-element union can be instantiated in zero different ways. It's statically impossible to create one.

Reading/Writing

An n-element union can be read/written in n different ways by accessing one of its n fields.

n = 2 n = 1 n = 0
union U2 { f0: T0, f1: T1, } union U1 { f0: T0, } union U0 {}
u.f0 = foo
u.f1 = foo
u.f0 = foo

A zero-element union can never be read/written.

Pattern matching

The RFC doesn't specify whether matching zero elements is allowed. All the examples show matching on a single element which means matching on a zero-element union is impossible. The unanswered questions section asks whether we should allow matching on a number of elements other than one. If so then U {} => ... would be allowed however this could only appear in dead code because U is impossible to create.

Representation

The RFC leaves representation open but lets assume a union can be thought of a chunk of memory with a size and alignment equal to the max size and max alignment of its elements.

n = 2 n = 1 n = 0
union U2 { f0: T0, f1: T1, } union U1 { f0: T0, } union U0 {}
max({T0, T1}) max({T0}) max({})

The max of the empty set is the identity of the max operation on two elements, ie. negative infinity. This is, conceptually, the size and alignment of uninhabited types.

Possible states

The number of possible states of a union is, at most, the sum of the number of possible states of it's elements. Of course some of these states may overlap, but this at least gives us an upper bound.

n = 2 n = 1 n = 0
union U2 { f0: T0, f1: T1, } union U1 { f0: T0, } union U0 {}
states(U2) = 0
+ states(T0)
+ states(T1)
states(U1) = 0
+ states(T0)
states(U0) = 0

A zero-element union can only be in one of at most zero possible states. Therefore it is uninhabited.

@glaebhoerl
Copy link
Contributor

I wish there were a reaction icon to indicate "nice comment" without necessarily implying "I agree".

@petrochenkov
Copy link
Contributor

petrochenkov commented Aug 29, 2016

@canndrew

#1444 (comment) is still strongly based on the "enum for which we don't know the discriminant" interpretation of union, it can be easily rewritten on the basis of "a struct with overlapping zero-offset fields" interpretation to "prove" that the opposite behavior is correct.

For example, the first statement: "An n-element union can be instantiated in n different ways by specifying one of its n fields." is incorrect in the "struct" interpretation in which the union can be instantiated by providing "sufficient number" of fields, which is 0 for empty unions.

@solson
Copy link
Member

solson commented Aug 29, 2016

So the two possible rules look like:

  1. unions are always initialized by providing exactly one field value
  2. unions are initialized by providing exactly one field value (or zero for the special case of an empty union)

The former rule seems far more elegant. I can see the argument that the struct interpretation would work, but it seems clearly uglier.

(Note that for the latter rule you cannot say something like "at most one field value", because for non-empty unions you must provide a field. Well, you can say "sufficient number" like @petrochenkov, but then you have to define what that means, which is what my parenthetical in the rule is doing.)

@nrc
Copy link
Member

nrc commented Aug 29, 2016

@joshtriplett @canndrew Perhaps one of you could open an issue to discuss empty unions? I fear that discussion on an already merged PR will be ignored or lost (I only noticed my ping because I was clearing out an email folder).

@joshtriplett
Copy link
Member Author

@nrc I don't feel strongly about it one way or another, and I don't have a use case for empty unions. However, I'd be happy to review an RFC or issue about this, or to discuss it further with anyone who does have a use case.

@canndrew
Copy link
Contributor

canndrew commented Sep 5, 2016

I was looking at the is_uninhabited method in librustc/ty/sty.rs and found this line. That's kinda wrong. It'll work for now but once that FIXME(24885) gets fixed (which is something I went there to do) it'll cause things to break because a union with a single uninhabited member type will be seen as uninhabited.

The real problem here though is that unions are implemented using AdtDefData. AdtDefData is for defining algebraic data types and unions are in no way, shape or form an algebraic data type.

@eddyb
Copy link
Member

eddyb commented Sep 5, 2016

@canndrew That will never be true for unions though, because .is_empty() is .variants.is_empty(), and unions always have one variant.
Sure, using AdtDefData for unions may seem wrong, but anything else would be anti-DRY in the compiler.
I'd rather rename AdtDef to DataDef or something instead of moving unions off of it.

@canndrew
Copy link
Contributor

canndrew commented Sep 5, 2016

That will never be true for unions though, because .is_empty() is .variants.is_empty(), and unions always have one variant.

Until the definition of .is_empty() changes to variants.iter().all(|v| v.is_empty()) so that it can detect that this struct is empty:

struct EmptyStruct {
    x: !,
    y: u32,
}

Then this union would get mis-detected as empty:

union NonEmptyUnion {
    x: !,
    y: u32,
}

And yes, we could check the IS_UNION flag in is_empty, but it seems bad for the compiler to be reusing the same datatype for both structs and unions in the first place. Structs and unions are only superficially similar in terms of syntax, the underlying types have completely different semantics.

@eddyb
Copy link
Member

eddyb commented Sep 5, 2016

@canndrew DRY prevails, so checking the AdtKind is the correct solution.
Also, it should probably be is_uninhabited instead of is_empty, maybe make it a Ty flag?
Come to think of it, you also have to handle cycles, not just union, so that'll be the bulk of the complexity.

@nox
Copy link
Contributor

nox commented Sep 11, 2016

Btw the RFC says that the feature is union, but rustc says untagged_unions.

@joshtriplett
Copy link
Member Author

@nox

Btw the RFC says that the feature is union, but rustc says untagged_unions.

True. It'd be nice to either fix rustc to use the feature name union, or fix the RFC. I'd prefer the former, since it's still hopefully early enough.

@petrochenkov
Copy link
Contributor

petrochenkov commented Sep 16, 2016

@joshtriplett
I chose the feature name in rustc to follow naming conventions + it's temporary and not really important + I wouldn't want to break people's code for nothing. I'd suggest to fix the RFC and forget about it (or simply forget about it).

@petrochenkov
Copy link
Contributor

Oh, and I changed the lint name to match conventions too, this is more important.
#[allow(union_field_drop)] => #[allow(unions_with_drop_fields)]

@joshtriplett
Copy link
Member Author

That seems fine. I do think we should update the RFC with those two changes. Do you want to write that patch or should I?

@petrochenkov
Copy link
Contributor

@joshtriplett
It's Friday evening here and I don't want to write any patches 😄

@petrochenkov petrochenkov mentioned this pull request Feb 23, 2017
KamilaBorowska added a commit to KamilaBorowska/rfcs that referenced this pull request Oct 18, 2017
petrochenkov added a commit that referenced this pull request Oct 18, 2017
Correct pull request URL in RFC #1444
petertodd added a commit to petertodd/rfcs that referenced this pull request Nov 1, 2017
@Centril Centril added A-syntax Syntax related proposals & ideas A-typesystem Type system related proposals & ideas A-machine Proposals relating to Rust's abstract machine. A-unions Proposals relating to union items labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-machine Proposals relating to Rust's abstract machine. A-syntax Syntax related proposals & ideas A-typesystem Type system related proposals & ideas A-unions Proposals relating to union items final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.