Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unions #1444

Merged
merged 14 commits into from
Apr 8, 2016
Merged

unions #1444

Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
325 changes: 325 additions & 0 deletions text/0000-untagged_union.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
- Feature Name: `untagged_union`
- Start Date: 2015-12-29
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary
[summary]: #summary

Provide native support for C-compatible unions, defined via a new keyword
`untagged_union`.

# Motivation
[motivation]: #motivation

Many FFI interfaces include unions. Rust does not currently have any native
representation for unions, so users of these FFI interfaces must define
multiple structs and transmute between them via `std::mem::transmute`. The
resulting FFI code must carefully understand platform-specific size and
alignment requirements for structure fields. Such code has little in common
with how a C client would invoke the same interfaces.

Introducing native syntax for unions makes many FFI interfaces much simpler and
less error-prone to write, simplifying the creation of bindings to native
libraries, and enriching the Rust/Cargo ecosystem.

A native union mechanism would also simplify Rust implementations of
space-efficient or cache-efficient structures relying on value representation,
such as machine-word-sized unions using the least-significant bits of aligned
pointers to distinguish cases.

The syntax proposed here avoids reserving `union` as the new keyword, as
existing Rust code already uses `union` for other purposes, including [multiple
functions in the standard
library](https://doc.rust-lang.org/std/?search=union).

To preserve memory safety, accesses to union fields may only occur in `unsafe`
code. Commonly, code using unions will provide safe wrappers around unsafe
union field accesses.

# Detailed design
[design]: #detailed-design

## Declaring a union type

A union declaration uses the same field declaration syntax as a `struct`
declaration, except with the keyword `untagged_union` in place of `struct`:

```rust
untagged_union MyUnion {
f1: u32,
f2: f32,
}
```

`untagged_union` implies `#[repr(C)]` as the default representation, making
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1
Let's be careful, #[repr(C)] should be explicit, at least as a future-proofing measure.
Inserting #[repr(C)] is not a huge burden, especially if FFI-bindings are autogenerated.
If non-#[repr(C)] unions are not supported yet, they should not compile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of requiring #[repr(C)] if you want the layout to be well specified.

`#[repr(C)] untagged_union` permissible but redundant.

## Instantiating a union

A union instantiation uses the same syntax as a struct instantiation, except
that it must specify exactly one field:

```rust
let u = MyUnion { f1: 1 };
```

Specifying multiple fields in a union instantiation results in a compiler
error.

Safe code may instantiate a union, as no unsafe behavior can occur until
accessing a field of the union. Code that wishes to maintain invariants about
the union fields should make the union fields private and provide public
functions that maintain the invariants.

## Reading fields

Unsafe code may read from union fields, using the same dotted syntax as a
struct:

```rust
fn f(u: MyUnion) -> f32 {
unsafe { u.f2 }
}
```

## Writing fields

Unsafe code may write to fields in a mutable union, using the same syntax as a
struct:

```rust
fn f(u: &mut MyUnion) {
unsafe {
u.f1 = 2;
}
}
```

If a union contains multiple fields of different sizes, assigning to a field
smaller than the entire union must not change the memory of the union outside
that field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the memory of the rest simply become undef?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, what happens here:

untagged_union X {
    a: u8,
    b: u16,
}

let mut x = X { b: 1 };
x.a = 1;
let y = x;

Does the compiler have to copy the unused part?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying a union into some other variable must always copy the entire memory of the union, unless the compiler can prove that nothing reads from other fields of the destination, in which case it could potentially elide moving some data around.

For instance, if you pass y to an FFI function, Rust can't know what parts of the union you intend to read, so it needs to copy the whole thing. On the other hand, if you pass y to a Rust function, and rustc can see that the called function only reads y.a, never y.b, then rustc could potentially elide the copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying a union into some other variable must always copy the entire memory of the union

Why? Simply make accessing any variant but the one that was written to last undefined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would break many valid usages. For instance, consider a union of a common_header struct and several structs that start with that header; writing to common_header should not invalidate the rest of the data. Ditto for many other common patterns used with unions.

Note that factoring the common header out of the union does not solve the problem. For instance, you might have different types of common headers used for subsets of other fields. And in general, moving fields into or out of a union could require platform-specific understanding of size and alignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have some examples of such code to see how unions must behave.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial example:

struct S {
    header: COMMON_HEADER,
    otherfields: SOME_TYPE,
}

untagged_union U {
    header: COMMON_HEADER,
    s: S,
    // ...
}

Writing to u.header (or fields of u.header) should not invalidate u.s and in particular u.s.otherfields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean open source C code from well known projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my other reply, MSVC does support writing to one variant and reading from another, which means that writing to one variant does not invalidate the non-overlapping bytes of other variants. So regardless of what the C standard dictates, we'd have to support this case on Windows at the very least, and I'm sure other major C compilers behave similarly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahkoh ACPICA includes a union with that exact pattern; see "union acpi_object" and ACPI_OBJECT_TYPE in https://github.com/acpica/acpica/blob/master/source/include/actypes.h .


## Pattern matching

Unsafe code may pattern match on union fields, using the same syntax as a
struct, without the requirement to mention every field of the union in a match
or use `..`:

```rust
fn f(u: MyUnion) {
unsafe {
match u {
MyUnion { f1: 10 } => { println!("ten"); }
MyUnion { f2 } => { println!("{}", f2); }
}
}
}
```

Matching a specific value from a union field makes a refutable pattern; naming
a union field without matching a specific value makes an irrefutable pattern.
Both require unsafe code.

Pattern matching may match a union as a field of a larger structure. In
particular, when using an `untagged_union` to implement a C tagged union via
FFI, this allows matching on the tag and the corresponding field
simultaneously:

```rust
#[repr(u32)]
enum Tag { I, F }

untagged_union U {
i: i32,
f: f32,
}

#[repr(C)]
struct Value {
tag: Tag,
u: U,
}

fn is_zero(v: Value) -> bool {
unsafe {
match v {
Value { tag: I, u: U { i: 0 } } => true,
Value { tag: F, u: U { f: 0.0 } } => true,
_ => false,
}
}
}
```

Note that a pattern match on a union field that has a smaller size than the
entire union must not make any assumptions about the value of the union's
memory outside that field.

## Borrowing union fields

Unsafe code may borrow a reference to a field of a union; doing so borrows the
entire union, such that any borrow conflicting with a borrow of the union
(including a borrow of another union field or a borrow of a structure
containing the union) will produce an error.

```rust
untagged_union U {
f1: u32,
f2: f32,
}

#[test]
fn test() {
let mut u = U { f1: 1 };
unsafe {
let b1 = &mut u.f1;
// let b2 = &mut u.f2; // This would produce an error
*b1 = 5;
}
unsafe {
assert_eq!(u.f1, 5);
}
}
```

Simultaneous borrows of multiple fields of a struct contained within a union do
not conflict:

```rust
struct S {
x: u32,
y: u32,
}

untagged_union U {
s: S,
both: u64,
}

#[test]
fn test() {
let mut u = U { s: S { x: 1, y: 2 } };
unsafe {
let bx = &mut u.s.x;
// let bboth = &mut u.both; // This would fail
let by = &mut u.s.y;
*bx = 5;
*by = 10;
}
unsafe {
assert_eq!(u.s.x, 5);
assert_eq!(u.s.y, 10);
}
}
```

## Union and field visibility

The `pub` keyword works on the union and on its fields, as with a struct. The
union and its fields default to private. Using a private field in a union
instantiation, field access, or pattern match produces an error.

## Uninitialized unions

The compiler should consider a union uninitialized if declared without an
initializer. However, providing a field during instantiation, or assigning to
a field, should cause the compiler to treat the entire union as initialized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to be very efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? This shouldn't matter one way or another for efficiency; I wrote this to clarify under what circumstances the compiler should give an error about accessing uninitialized data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's what you mean by initialized the that's fine. But since many fields might not be initialized, why do you even require that one field is initialized before the union can be accessed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it seems straightforward for the compiler to detect, and unambiguously an error. It makes sense to write one field and then pass the union to some function expecting to read that field; it never makes sense to read a field from a newly declared union that you've never written to or initialized at all.

Given your comment, I should update the RFC to clarify this paragraph to specifically reference compiler errors about accessing uninitialized variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it never makes sense to read a field from a newly declared union that you've never written to or initialized at all.

It makes sense to pass a reference to a union to another function which then fills it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahkoh The same is true of (), but we don't allow let x: (); f(&mut x);. We shouldn't start allowing stuff like this now without a strong reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

() doesn't allow one to access uninitialized bytes. unions do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that as a compelling argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, you can declare a mutable structure without any initializer, write to all of the fields, and then use the structure; you only get an error if you don't write to a field. Given the definition of a union, it seems completely equivalent to say that you can declare a mutable union without any initializer, write to one of its fields, and then use the union.

That said, if this proves a sticking point, I don't think dropping it would make uses of unions significantly more onerous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, you can declare a mutable structure without any initializer, write to all of the fields, and then use the structure; you only get an error if you don't write to a field.

This doesn't seem to be true: http://is.gd/PGiVd6

I agree that dropping it doesn't make unions much worse.


## Unions and traits

A union may have trait implementations, using the same syntax as a struct.

The compiler should warn if a union field has a type that implements the `Drop`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning or an error? I assume that the destructor of the field would not run when the union is dropped, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to make it an error, yes. However, Rust does not consider leaks or failing to run a destructor unsafe behavior, per the discussion that occurred around scoped threads. See the documentation of std::mem::forget.

So, I assumed that people would object to making this an error. If not, then I can quite happily change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to forbid Drop types for now. We can always change it to allow them later if there turn out to be compelling use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

trait.

## Unions and undefined behavior

Rust code must not use unions to invoke [undefined
behavior](https://doc.rust-lang.org/nightly/reference.html#behavior-considered-undefined).
In particular, Rust code must not use unions to break the pointer aliasing
rules with raw pointers, or access a field containing a primitive type with an
invalid value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undef propagation seems to be the bigger problem because it can actually happen with the primitive types usually used in FFI unions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean the "Reads of undef (uninitialized) memory" item, yes, agreed; your unsafe code should avoid that. (Rust already has unsafe functions that would make it possible to read uninitialized memory.)


## Union size and alignment

A union must have the same size and alignment as an equivalent C union
declaration for the target platform. Typically, a union would have the maximum
size of any of its fields, and the maximum alignment of any of its fields.
Note that those maximums may come from different fields; for instance:

```rust
untagged_union U {
f1: u16,
f2: [u8; 4],
}

#[test]
fn test() {
assert_eq!(std::mem::size_of<U>(), 4);
assert_eq!(std::mem::align_of<U>(), 2);
}
```

# Drawbacks
[drawbacks]: #drawbacks

Adding a new type of data structure would increase the complexity of the
language and the compiler implementation, albeit marginally. However, this
change seems likely to provide a net reduction in the quantity and complexity
of unsafe code.

# Alternatives
[alternatives]: #alternatives

- Don't do anything, and leave users of FFI interfaces with unions to continue
writing complex platform-specific transmute code.
- Create macros to define unions and access their fields. However, such macros
make field accesses and pattern matching look more cumbersome and less
structure-like. The implementation and use of such macros provides strong
motivation to seek a better solution, and indeed existing writers and users
of such macros have specifically requested native syntax in Rust.
- Define unions without a new keyword `untagged_union`, such as via
`#[repr(union)] struct`. This would avoid any possibility of breaking
existing code that uses the keyword, but would make declarations more
verbose, and introduce potential confusion with `struct` (or whatever
existing construct the `#[repr(union)]` attribute modifies).
- Use a compound keyword like `unsafe union`, while not reserving `union` on
its own as a keyword, to avoid breaking use of `union` as an identifier.
Potentially more appealing syntax, if the Rust parser can support it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just parse it depending on the context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpreting "union" as both a keyword and as an identifier seems rather challenging to support, and potentially fragile for future parser changes.

- Use a new operator to access union fields, rather than the same `.` operator
used for struct fields. This would make union fields more obvious at the
time of access, rather than making them look syntactically identical to
struct fields despite the semantic difference in storage representation.
- The [unsafe enum](https://github.com/rust-lang/rfcs/pull/724) proposal:
introduce untagged enums, identified with `unsafe enum`. Pattern-matching
syntax would make field accesses significantly more verbose than structure
field syntax.
- The [unsafe enum](https://github.com/rust-lang/rfcs/pull/724) proposal with
the addition of struct-like field access syntax. The resulting field access
syntax would look much like this proposal; however, pairing an enum-style
definition with struct-style usage seems confusing for developers. An
enum-based declaration leads users to expect enum-like syntax; a new
construct distinct from both enum and struct does not lead to such
expectations, and developers used to C unions will expect struct-like field
access for unions.

# Unresolved questions
[unresolved]: #unresolved-questions

Can the borrow checker support the rule that "simultaneous borrows of multiple
fields of a struct contained within a union do not conflict"? If not, omitting
that rule would only marginally increase the verbosity of such code, by
requiring an explicit borrow of the entire struct first.

Can a pattern match match multiple fields of a union at once? For rationale,
consider a union using the low bits of an aligned pointer as a tag; a pattern
match may match the tag using one field and a value identified by that tag
using another field. However, if this complicates the implementation, omitting
it would not significantly complicate code using unions.

C APIs using unions often also make use of anonymous unions and anonymous
structs. For instance, a union may contain anonymous structs to define
non-overlapping fields, and a struct may contain an anonymous union to define
overlapping fields. This RFC does not define anonymous unions or structs, but
a subsequent RFC may wish to do so.