-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unions #1444
unions #1444
Changes from 1 commit
4126461
b2d8ca0
c826e67
2e5ef18
9b4b8af
17feb14
c123950
98a5809
c60da1f
8026532
7f40a6b
b2e0309
3f45683
7727fb3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,325 @@ | ||
- Feature Name: `untagged_union` | ||
- Start Date: 2015-12-29 | ||
- RFC PR: (leave this empty) | ||
- Rust Issue: (leave this empty) | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
Provide native support for C-compatible unions, defined via a new keyword | ||
`untagged_union`. | ||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
Many FFI interfaces include unions. Rust does not currently have any native | ||
representation for unions, so users of these FFI interfaces must define | ||
multiple structs and transmute between them via `std::mem::transmute`. The | ||
resulting FFI code must carefully understand platform-specific size and | ||
alignment requirements for structure fields. Such code has little in common | ||
with how a C client would invoke the same interfaces. | ||
|
||
Introducing native syntax for unions makes many FFI interfaces much simpler and | ||
less error-prone to write, simplifying the creation of bindings to native | ||
libraries, and enriching the Rust/Cargo ecosystem. | ||
|
||
A native union mechanism would also simplify Rust implementations of | ||
space-efficient or cache-efficient structures relying on value representation, | ||
such as machine-word-sized unions using the least-significant bits of aligned | ||
pointers to distinguish cases. | ||
|
||
The syntax proposed here avoids reserving `union` as the new keyword, as | ||
existing Rust code already uses `union` for other purposes, including [multiple | ||
functions in the standard | ||
library](https://doc.rust-lang.org/std/?search=union). | ||
|
||
To preserve memory safety, accesses to union fields may only occur in `unsafe` | ||
code. Commonly, code using unions will provide safe wrappers around unsafe | ||
union field accesses. | ||
|
||
# Detailed design | ||
[design]: #detailed-design | ||
|
||
## Declaring a union type | ||
|
||
A union declaration uses the same field declaration syntax as a `struct` | ||
declaration, except with the keyword `untagged_union` in place of `struct`: | ||
|
||
```rust | ||
untagged_union MyUnion { | ||
f1: u32, | ||
f2: f32, | ||
} | ||
``` | ||
|
||
`untagged_union` implies `#[repr(C)]` as the default representation, making | ||
`#[repr(C)] untagged_union` permissible but redundant. | ||
|
||
## Instantiating a union | ||
|
||
A union instantiation uses the same syntax as a struct instantiation, except | ||
that it must specify exactly one field: | ||
|
||
```rust | ||
let u = MyUnion { f1: 1 }; | ||
``` | ||
|
||
Specifying multiple fields in a union instantiation results in a compiler | ||
error. | ||
|
||
Safe code may instantiate a union, as no unsafe behavior can occur until | ||
accessing a field of the union. Code that wishes to maintain invariants about | ||
the union fields should make the union fields private and provide public | ||
functions that maintain the invariants. | ||
|
||
## Reading fields | ||
|
||
Unsafe code may read from union fields, using the same dotted syntax as a | ||
struct: | ||
|
||
```rust | ||
fn f(u: MyUnion) -> f32 { | ||
unsafe { u.f2 } | ||
} | ||
``` | ||
|
||
## Writing fields | ||
|
||
Unsafe code may write to fields in a mutable union, using the same syntax as a | ||
struct: | ||
|
||
```rust | ||
fn f(u: &mut MyUnion) { | ||
unsafe { | ||
u.f1 = 2; | ||
} | ||
} | ||
``` | ||
|
||
If a union contains multiple fields of different sizes, assigning to a field | ||
smaller than the entire union must not change the memory of the union outside | ||
that field. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why doesn't the memory of the rest simply become undef? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In particular, what happens here: untagged_union X {
a: u8,
b: u16,
}
let mut x = X { b: 1 };
x.a = 1;
let y = x; Does the compiler have to copy the unused part? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Copying a union into some other variable must always copy the entire memory of the union, unless the compiler can prove that nothing reads from other fields of the destination, in which case it could potentially elide moving some data around. For instance, if you pass There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why? Simply make accessing any variant but the one that was written to last undefined. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would break many valid usages. For instance, consider a union of a Note that factoring the common header out of the union does not solve the problem. For instance, you might have different types of common headers used for subsets of other fields. And in general, moving fields into or out of a union could require platform-specific understanding of size and alignment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be good to have some examples of such code to see how unions must behave. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trivial example: struct S {
header: COMMON_HEADER,
otherfields: SOME_TYPE,
}
untagged_union U {
header: COMMON_HEADER,
s: S,
// ...
} Writing to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean open source C code from well known projects. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned in my other reply, MSVC does support writing to one variant and reading from another, which means that writing to one variant does not invalidate the non-overlapping bytes of other variants. So regardless of what the C standard dictates, we'd have to support this case on Windows at the very least, and I'm sure other major C compilers behave similarly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mahkoh ACPICA includes a union with that exact pattern; see "union acpi_object" and ACPI_OBJECT_TYPE in https://github.com/acpica/acpica/blob/master/source/include/actypes.h . |
||
|
||
## Pattern matching | ||
|
||
Unsafe code may pattern match on union fields, using the same syntax as a | ||
struct, without the requirement to mention every field of the union in a match | ||
or use `..`: | ||
|
||
```rust | ||
fn f(u: MyUnion) { | ||
unsafe { | ||
match u { | ||
MyUnion { f1: 10 } => { println!("ten"); } | ||
MyUnion { f2 } => { println!("{}", f2); } | ||
} | ||
} | ||
} | ||
``` | ||
|
||
Matching a specific value from a union field makes a refutable pattern; naming | ||
a union field without matching a specific value makes an irrefutable pattern. | ||
Both require unsafe code. | ||
|
||
Pattern matching may match a union as a field of a larger structure. In | ||
particular, when using an `untagged_union` to implement a C tagged union via | ||
FFI, this allows matching on the tag and the corresponding field | ||
simultaneously: | ||
|
||
```rust | ||
#[repr(u32)] | ||
enum Tag { I, F } | ||
|
||
untagged_union U { | ||
i: i32, | ||
f: f32, | ||
} | ||
|
||
#[repr(C)] | ||
struct Value { | ||
tag: Tag, | ||
u: U, | ||
} | ||
|
||
fn is_zero(v: Value) -> bool { | ||
unsafe { | ||
match v { | ||
Value { tag: I, u: U { i: 0 } } => true, | ||
Value { tag: F, u: U { f: 0.0 } } => true, | ||
_ => false, | ||
} | ||
} | ||
} | ||
``` | ||
|
||
Note that a pattern match on a union field that has a smaller size than the | ||
entire union must not make any assumptions about the value of the union's | ||
memory outside that field. | ||
|
||
## Borrowing union fields | ||
|
||
Unsafe code may borrow a reference to a field of a union; doing so borrows the | ||
entire union, such that any borrow conflicting with a borrow of the union | ||
(including a borrow of another union field or a borrow of a structure | ||
containing the union) will produce an error. | ||
|
||
```rust | ||
untagged_union U { | ||
f1: u32, | ||
f2: f32, | ||
} | ||
|
||
#[test] | ||
fn test() { | ||
let mut u = U { f1: 1 }; | ||
unsafe { | ||
let b1 = &mut u.f1; | ||
// let b2 = &mut u.f2; // This would produce an error | ||
*b1 = 5; | ||
} | ||
unsafe { | ||
assert_eq!(u.f1, 5); | ||
} | ||
} | ||
``` | ||
|
||
Simultaneous borrows of multiple fields of a struct contained within a union do | ||
not conflict: | ||
|
||
```rust | ||
struct S { | ||
x: u32, | ||
y: u32, | ||
} | ||
|
||
untagged_union U { | ||
s: S, | ||
both: u64, | ||
} | ||
|
||
#[test] | ||
fn test() { | ||
let mut u = U { s: S { x: 1, y: 2 } }; | ||
unsafe { | ||
let bx = &mut u.s.x; | ||
// let bboth = &mut u.both; // This would fail | ||
let by = &mut u.s.y; | ||
*bx = 5; | ||
*by = 10; | ||
} | ||
unsafe { | ||
assert_eq!(u.s.x, 5); | ||
assert_eq!(u.s.y, 10); | ||
} | ||
} | ||
``` | ||
|
||
## Union and field visibility | ||
|
||
The `pub` keyword works on the union and on its fields, as with a struct. The | ||
union and its fields default to private. Using a private field in a union | ||
instantiation, field access, or pattern match produces an error. | ||
|
||
## Uninitialized unions | ||
|
||
The compiler should consider a union uninitialized if declared without an | ||
initializer. However, providing a field during instantiation, or assigning to | ||
a field, should cause the compiler to treat the entire union as initialized. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That doesn't seem to be very efficient. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean? This shouldn't matter one way or another for efficiency; I wrote this to clarify under what circumstances the compiler should give an error about accessing uninitialized data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that's what you mean by initialized the that's fine. But since many fields might not be initialized, why do you even require that one field is initialized before the union can be accessed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because it seems straightforward for the compiler to detect, and unambiguously an error. It makes sense to write one field and then pass the union to some function expecting to read that field; it never makes sense to read a field from a newly declared union that you've never written to or initialized at all. Given your comment, I should update the RFC to clarify this paragraph to specifically reference compiler errors about accessing uninitialized variables. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It makes sense to pass a reference to a union to another function which then fills it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mahkoh The same is true of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. () doesn't allow one to access uninitialized bytes. unions do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see that as a compelling argument. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now, you can declare a mutable structure without any initializer, write to all of the fields, and then use the structure; you only get an error if you don't write to a field. Given the definition of a union, it seems completely equivalent to say that you can declare a mutable union without any initializer, write to one of its fields, and then use the union. That said, if this proves a sticking point, I don't think dropping it would make uses of unions significantly more onerous. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be true: http://is.gd/PGiVd6 I agree that dropping it doesn't make unions much worse. |
||
|
||
## Unions and traits | ||
|
||
A union may have trait implementations, using the same syntax as a struct. | ||
|
||
The compiler should warn if a union field has a type that implements the `Drop` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be a warning or an error? I assume that the destructor of the field would not run when the union is dropped, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to make it an error, yes. However, Rust does not consider leaks or failing to run a destructor unsafe behavior, per the discussion that occurred around scoped threads. See the documentation of So, I assumed that people would object to making this an error. If not, then I can quite happily change this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My preference would be to forbid Drop types for now. We can always change it to allow them later if there turn out to be compelling use cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
trait. | ||
|
||
## Unions and undefined behavior | ||
|
||
Rust code must not use unions to invoke [undefined | ||
behavior](https://doc.rust-lang.org/nightly/reference.html#behavior-considered-undefined). | ||
In particular, Rust code must not use unions to break the pointer aliasing | ||
rules with raw pointers, or access a field containing a primitive type with an | ||
invalid value. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Undef propagation seems to be the bigger problem because it can actually happen with the primitive types usually used in FFI unions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you mean the "Reads of undef (uninitialized) memory" item, yes, agreed; your unsafe code should avoid that. (Rust already has unsafe functions that would make it possible to read uninitialized memory.) |
||
|
||
## Union size and alignment | ||
|
||
A union must have the same size and alignment as an equivalent C union | ||
declaration for the target platform. Typically, a union would have the maximum | ||
size of any of its fields, and the maximum alignment of any of its fields. | ||
Note that those maximums may come from different fields; for instance: | ||
|
||
```rust | ||
untagged_union U { | ||
f1: u16, | ||
f2: [u8; 4], | ||
} | ||
|
||
#[test] | ||
fn test() { | ||
assert_eq!(std::mem::size_of<U>(), 4); | ||
assert_eq!(std::mem::align_of<U>(), 2); | ||
} | ||
``` | ||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
Adding a new type of data structure would increase the complexity of the | ||
language and the compiler implementation, albeit marginally. However, this | ||
change seems likely to provide a net reduction in the quantity and complexity | ||
of unsafe code. | ||
|
||
# Alternatives | ||
[alternatives]: #alternatives | ||
|
||
- Don't do anything, and leave users of FFI interfaces with unions to continue | ||
writing complex platform-specific transmute code. | ||
- Create macros to define unions and access their fields. However, such macros | ||
make field accesses and pattern matching look more cumbersome and less | ||
structure-like. The implementation and use of such macros provides strong | ||
motivation to seek a better solution, and indeed existing writers and users | ||
of such macros have specifically requested native syntax in Rust. | ||
- Define unions without a new keyword `untagged_union`, such as via | ||
`#[repr(union)] struct`. This would avoid any possibility of breaking | ||
existing code that uses the keyword, but would make declarations more | ||
verbose, and introduce potential confusion with `struct` (or whatever | ||
existing construct the `#[repr(union)]` attribute modifies). | ||
- Use a compound keyword like `unsafe union`, while not reserving `union` on | ||
its own as a keyword, to avoid breaking use of `union` as an identifier. | ||
Potentially more appealing syntax, if the Rust parser can support it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not just parse it depending on the context? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interpreting "union" as both a keyword and as an identifier seems rather challenging to support, and potentially fragile for future parser changes. |
||
- Use a new operator to access union fields, rather than the same `.` operator | ||
used for struct fields. This would make union fields more obvious at the | ||
time of access, rather than making them look syntactically identical to | ||
struct fields despite the semantic difference in storage representation. | ||
- The [unsafe enum](https://github.com/rust-lang/rfcs/pull/724) proposal: | ||
introduce untagged enums, identified with `unsafe enum`. Pattern-matching | ||
syntax would make field accesses significantly more verbose than structure | ||
field syntax. | ||
- The [unsafe enum](https://github.com/rust-lang/rfcs/pull/724) proposal with | ||
the addition of struct-like field access syntax. The resulting field access | ||
syntax would look much like this proposal; however, pairing an enum-style | ||
definition with struct-style usage seems confusing for developers. An | ||
enum-based declaration leads users to expect enum-like syntax; a new | ||
construct distinct from both enum and struct does not lead to such | ||
expectations, and developers used to C unions will expect struct-like field | ||
access for unions. | ||
|
||
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
Can the borrow checker support the rule that "simultaneous borrows of multiple | ||
fields of a struct contained within a union do not conflict"? If not, omitting | ||
that rule would only marginally increase the verbosity of such code, by | ||
requiring an explicit borrow of the entire struct first. | ||
|
||
Can a pattern match match multiple fields of a union at once? For rationale, | ||
consider a union using the low bits of an aligned pointer as a tag; a pattern | ||
match may match the tag using one field and a value identified by that tag | ||
using another field. However, if this complicates the implementation, omitting | ||
it would not significantly complicate code using unions. | ||
|
||
C APIs using unions often also make use of anonymous unions and anonymous | ||
structs. For instance, a union may contain anonymous structs to define | ||
non-overlapping fields, and a struct may contain an anonymous union to define | ||
overlapping fields. This RFC does not define anonymous unions or structs, but | ||
a subsequent RFC may wish to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1
Let's be careful,
#[repr(C)]
should be explicit, at least as a future-proofing measure.Inserting
#[repr(C)]
is not a huge burden, especially if FFI-bindings are autogenerated.If non-
#[repr(C)]
unions are not supported yet, they should not compile.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of requiring
#[repr(C)]
if you want the layout to be well specified.