-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alloca for Rust #1808
Alloca for Rust #1808
Conversation
[drawbacks]: #drawbacks | ||
|
||
- Even more stack usage means the dreaded stack limit will probably be reached even sooner. Overflowing the stack space | ||
leads to segfaults at best and undefined behavior at worst. On unices, the stack can usually be extended at runtime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we not be able to correctly probe the stack for space when alloca-ing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to do stack probes. If they aren't used, though, this must be marked unsafe, as it's trivial to corrupt memory without them.
Speaking of which, are stack probes still not actually in place for regular stack allocations? Just tried compiling a test program (that allocates a large array on the stack) with rustc nightly on my system and didn't see any probes in the asm output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to do stack probes. If they aren't used, though, this must be marked unsafe, as it's trivial to corrupt memory without them.
Can't one already overflow the stack with recursion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, which is why stack probes are inserted so that the runtime can detect that and abort with a fatal runtime error: stack overflow
message rather than just run off into whatever memory is below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except they actually aren't, since it hasn't been implemented yet (or maybe it only works on Windows?). OSes normally provide a 4kb guard page below the stack, so most stack overflows will crash anyway (and trigger that error, which comes from a signal handler), but a function with a big enough stack frame can skip past that page, and I think it may actually be possible in practice to exploit some Rust programs that way... I should try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @whitequark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on integrating stack probes, which are required for robustness on our embedded system (ARTIQ). I expect to implement them in a way generic enough to be used on all bare-metal platforms, including no-MMU, as well as allow for easy shimming on currently unsupported OSes.
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
- Could we return the slice directly (reducing visible complexity)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it, but I'm not sure I understand why we wouldn't be able to just return the slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess:
Because the size of the returned value could be variable and would need to be copied (or preserved on the stack). Not sure rust has anything that does that right now. Suppose it could be done, could just become a detail of the calling convention. Can't have caller allocate unless it knows how much memory to allocate, and making it aware of the memory needed by the callee could complicate things (and would likely be impossible to do in some cases without running the callee twice).
If these stack allocations were parameterized with type-level numbers, it would be fairly straight forward (ignoring all the general complexity of type level numbers), but this RFC doesn't propose that.
compiler would become somewhat more complex (though a simple incomplete escape analysis implementation is already in | ||
[clippy](https://github.com/Manishearth/rust-clippy). | ||
|
||
# Unresolved questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C's alloca
can't be used inside of a function argument list - would we need the same restriction or would we handle that properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that that limitation exists primarily to allow naive single pass compilers to exist (along with some "interesting" ways of implementing alloca()
). I don't think that concern would apply to rust.
Using a function for this would be a mistake IMO. It's not a true function in C either, for various reasons.
This model is closer to VLAs than |
I do know some people are very wary of alloca/vlas in rust, but I can't remember who.... |
It's worth being careful to avoid accidentally considering any constant |
@burdges We already have an analysis which treats rust-lang/rust#34344 correctly - the problem there is the evaluator (which is AST-based and pre-typeck, this will change next year hopefully), not the qualifier. A simpler and far more problematic example would be something like |
@steveklabnik That might've been @cmr. |
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
- Even more stack usage means the dreaded stack limit will probably be reached even sooner. Overflowing the stack space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the flip side, this might reduce stack usage for users of ArrayVec
and those manually allocating overly-large arrays on the stack (I occasionally do this when reading small files).
Could we get an example of how one might use this |
``` | ||
|
||
`StackSlice`'s embedded lifetime ensures that the stack allocation may never leave its scope. Thus the borrow checker | ||
can uphold the contract that LLVM's `alloca` requires. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see why the type of reserve
would prevent StackSlice
to escape the scope where reserve
is called. More precisely, It seems like the user could define:
fn reserve_and_zero<'a>(elements : usize) -> &'a[u32] {
let s = *reserve(usize);
for x in s.iter_mut() { *x = 0 }
return s
}
Which would be invalid if it is not inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the signatures proposed here is completely unsound. Since 'a
is a lifetime parameter not constrained by anything, any call to reserve
can simply pick 'a = 'static
.
|
||
- `mem::with_alloc<T, F: Fn([T]) -> U>(elems: usize, code: F) -> U` This has the benefit of reducing API surface, and | ||
introducing rightwards drift, which makes it more unlikely to be used too much. However, it also needs to be | ||
monomorphized for each function (instead of only for each target type), which will increase compile times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only solution that will really work, since otherwise you can always use the reserve
intrinsic to create a slice with a 'static
lifetime
monomorphized for each function (instead of only for each target type), which will increase compile times. | ||
|
||
- dynamically sized arrays are a potential solution, however, those would need to have a numerical type that is only | ||
fully known at runtime, requiring complex type system gymnastics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generalization of this features is to allow unsized locals (which simply allocate stack space for the size given by the memory which they are moved out of)
To prevent accidental usage of this, a language item StackBox<T: !Sized>
could be created, which alloca's all its memory depending on the size_of_val
. The problem is that this would need to run code at moving, which is unprecedented in Rust
@burdges Oh I remember now, it's just that it's been a while since I thought about this. |
As I have already said in an issue somewhere, the assembly generated by SmallVec is very disappointing and saying that it "works well enough" is not really true. For example this Rust code: extern {
fn foo(data: *const u32);
}
pub fn test() {
let mut a: SmallVec<[u32; 4]> = SmallVec::new();
a.push(5);
a.push(12);
a.push(94);
a.push(1);
unsafe { foo(a.as_ptr()) }
} ...generates the assembly below: pushq %rbp
subq $80, %rsp
leaq 80(%rsp), %rbp
movq $-2, -8(%rbp)
xorps %xmm0, %xmm0
movups %xmm0, -36(%rbp)
movaps %xmm0, -48(%rbp)
leaq -48(%rbp), %rcx
movl $5, %edx
callq _ZN36_$LT$smallvec..SmallVec$LT$A$GT$$GT$4push17hcb9707aa0afc3e57E
leaq -48(%rbp), %rcx
movl $12, %edx
callq _ZN36_$LT$smallvec..SmallVec$LT$A$GT$$GT$4push17hcb9707aa0afc3e57E
leaq -48(%rbp), %rcx
movl $94, %edx
callq _ZN36_$LT$smallvec..SmallVec$LT$A$GT$$GT$4push17hcb9707aa0afc3e57E
leaq -48(%rbp), %rcx
movl $1, %edx
callq _ZN36_$LT$smallvec..SmallVec$LT$A$GT$$GT$4push17hcb9707aa0afc3e57E
cmpl $1, -40(%rbp)
leaq -36(%rbp), %rcx
cmoveq -32(%rbp), %rcx
callq foo
cmpl $1, -40(%rbp)
jne .LBB4_5
movq -24(%rbp), %rdx
testq %rdx, %rdx
je .LBB4_8
movq -32(%rbp), %rcx
shlq $2, %rdx
movl $4, %r8d
callq __rust_deallocate
jmp .LBB4_8
leaq -32(%rbp), %rax
movl $1, -40(%rbp)
xorps %xmm0, %xmm0
movups %xmm0, (%rax)
addq $80, %rsp
popq %rbp
retq And this is the pushq %r15
pushq %r14
pushq %rsi
pushq %rdi
pushq %rbp
pushq %rbx
subq $40, %rsp
movl %edx, %r14d
movq %rcx, %rsi
movl 8(%rsi), %ebx
cmpl $1, %ebx
movl $4, %ebp
cmoveq 24(%rsi), %rbp
movq (%rsi), %rax
cmpq %rbp, %rax
jne .LBB1_1
leaq (%rbp,%rbp), %rax
cmpq $1, %rax
movl $1, %r15d
cmovaq %rax, %r15
cmpq %r15, %rbp
ja .LBB1_11
movl $4, %ecx
movq %r15, %rax
mulq %rcx
jo .LBB1_12
movl $1, %edi
testq %rax, %rax
je .LBB1_6
movl $4, %edx
movq %rax, %rcx
callq __rust_allocate
movq %rax, %rdi
testq %rdi, %rdi
je .LBB1_13
leaq 12(%rsi), %rdx
cmpl $1, %ebx
cmoveq 16(%rsi), %rdx
shlq $2, %rbp
movq %rdi, %rcx
movq %rbp, %r8
callq memcpy
cmpl $1, 8(%rsi)
jne .LBB1_9
movq 24(%rsi), %rdx
testq %rdx, %rdx
je .LBB1_9
movq 16(%rsi), %rcx
shlq $2, %rdx
movl $4, %r8d
callq __rust_deallocate
movl $1, 8(%rsi)
movq %rdi, 16(%rsi)
movq %r15, 24(%rsi)
movq (%rsi), %rax
jmp .LBB1_10
leaq 12(%rsi), %rdi
cmpl $1, %ebx
cmoveq 16(%rsi), %rdi
movl %r14d, (%rdi,%rax,4)
incq (%rsi)
addq $40, %rsp
popq %rbx
popq %rbp
popq %rdi
popq %rsi
popq %r14
popq %r15
retq (I stripped out the labels in order to make it more readable). As you can see this is really suboptimal, both in terms of performances (number of instructions executed) and binary size. As this point I'm even wondering if using a If it remained to be proven, this is for me the main motivation for having |
I am unconvinced this is a benefit. When properly implemented, alloca is safe, extremely cheap, has nice locality and aliasing guarantees and even avoids an |
Interesting @eddyb so Actually I suppose it'd eventually compile if |
Thanks everyone for the great comments! I'm going to follow @oli-obk 's advice and change to the closure based version. Need some time to do it, though... 😄 |
@llogiq I'm trying to tell you that anything using functions is unnecessarily suboptimal from an implementation perspective. The feature LLVM has is effectively VLAs. clang implements
@burdges I'm not sure what you mean here, you can never return a stack local and rvalue promotion is backwards compatible in that expanding the set of constants only allows more code to compile. |
@llogiq Apart from @eddyb's concerns, which I also support, using a closure will make this feature significantly less useful. For example then you could not call alloca in a loop while accumulating the results, which I would like to rely on for writing zero-heap-allocation parsers. Also, in general I am opposed to deliberately decreasing usability in arbitrary ways because of your (or anyone else's) subjective opinion of whether it is "considered bad". You are not making a tradeoff with some other feature, as it usually goes in language design, you are simply making my life harder for no reason at all. |
Now, a constructive suggestion: it looks like there is no existing way to reflect the lifetime of alloca exactly, and I agree with other commenters that introducing VLAs seems like an excessively invasive change for a fairly niche use case. How about making it a built-in macro? let n = 10; for i in 1..n { let ary: &[u8] = alloca![0; n]; } This is nicely similar to the existing case of |
@whitequark While that could work, doing it with an intrinsic is more problematic than having it first-class. Hmm, we can make the macro encapsulate the unsafety by using a helper function that takes a reference to some other stack local and equates the lifetimes, only problem being that I'm not sure how to create something like that which lives long enough to be useful. May need to abuse some of the rvalue lifetime rules to get it. EDIT: I suppose there's always the possibility of making the intrinsic generic over a lifetime that the compiler restricts artificially. |
Yes, that's what I was thinking about. The intrinsic would merely translate to alloca of the right type directly, and the lifetime will be restricted early (you could do it right after you introduce lifetimes into the AST, really). I think this is a very nonintrusive solution. |
@eddyb Ok, so it would need to be a compiler-internal macro. That's a neat solution to both the lifetime problem and the problem of extending the compiler with as little fuss as possible. One thing I should probably add – because this is low-level, performance-über-alles, we'd likely not want to initialize the memory, so we could expand to mem::uninitialized(), rigging the type as necessary, but the whole thing would be unsafe and should be marked as such. I'm not too deep into macros; do we have unsafe macros? |
No, this isn't necessarily the case. There are different motivations for using alloca: performance, avoiding using the heap (because there might not be one), maybe others. I want to use this alloca facility but I still zero the memory first and in particular I don't want to have to use I'd suggest instead that the feature should allow the user to somehow specify the value to fill the memory with. Then the user could pass |
I agree with @briansmith on the safety aspect. I think that a safe interface should be provided as main interface with a additional way to have it uninitialized. Through passing in So maybe |
Ideally the initializing version would take an iterator so it could be initialized with something other than copies of the same value. |
@whitequark Can you elaborate on cases that this would solve but VLAs wouldn't? |
@joshtriplett Allocating in a loop. |
This is overstating things a bit. There's one particular thing that you want to do, which that proposal (nor the current form of this one, IIUC), doesn't support. It's not clear that that particular thing really needs to be supported, and that thing doesn't comprise all or even most “embedded” use cases. However, I think there is a more serious problem with #1909, and maybe also this RFC: The proposed syntax for VLAs hint that they can be used intuitively like arrays |
This is already supported AFAIK. The last field of a struct can be a DST, and is itself thus a DST. |
I understand that. But, adding support for stack-allocating values of such structs is, IMO, counterproductive, because of the severe limitations mentioned in #1909: They can't be passed by value (copied/moved) nor returned from functions. Given those limitations, is stack allocation of these types widely useful? Those limitations are dealbreakers for my code. Maybe other people have other use cases for which they'd work, in which case it would be useful to see them. |
@whitequark Hmmm. You could certainly allocate a VLA in a loop. But as @briansmith notes, you can't copy/move them, which means you can't accumulate them in a loop. Fair point. That seems worth considering in the discussions between this issue and #1909. |
Accumulating alloca (aka |
With #1909, you can definitely move rvalues (e.g., you should be able to call |
That sounds better than I understood from the discussion. Could we return them from functions? In particular, given `#[derive(Clone, Copy)]
struct Elem {
value: [usize]
}
impl Elem {
pub fn new(...) -> Self { ... }
}
pub fn elem_product(x: &Elem, y: &Elem, m: &Modulus) -> Elem { ... }
pub fn elem_pow(base: Elem, e: i32, m: &Modulus) -> Elem {
let mut table = [Elem { value: [0; dynamic base.value.len()] }; 5];
...
} If VLAs look like arrays then people will expect all of that kind of stuff to work, especially since it works fine for arrays. If that stuff doesn't work then a syntax and semantics that looks and feels more like " |
There's also a possible way of returning them from functions, but the ABI implications are... quite interesting in a possibly non-portable way (either the called function must allocate stack on the caller's stack, or the caller needs to access memory after the called function returned). |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
1 similar comment
🔔 This is now entering its final comment period, as per the review above. 🔔 |
It's nice to see Rust take more care of stack allocation. A VLA is usable as brick to build more complex stack-allocated data structures. |
The final comment period is now complete. |
It is not yet clear to me whether #1909 or this RFC is the way forward, as both seem incomplete to me. |
Ain't clear if you can avoid some incompleteness here since it'd be nice to future-proof against some rather far out stuff. As an extreme example, We clearly want stuff like raw-ish alloca well before anything like this. And I think the syntax |
@budges yeah that's roughly what we eventually want in principle, but there's so many subtleties in the design I'd be extremely wary about declaring something forward comparable now. |
Per the FCP, I am going to close this RFC, in favor of unsized rvalues. Thanks @llogiq for the work you put into it. =) |
Is it worth experimenting with making
|
@burdges: If a dependency then impls Trait on an unsized type, then that's a semver break. This would basically make unsized types implicitly "If none do X, then Y" - negative reasoning - is a thorny thicket to dive into. |
Oops, I misspoke, but you read into my mistake more sense than existed. ;) There is no way the compiler could statically determine all types It requires way more hackery to make this make sense:
In this code, we know all I can now answer my own question: We lack a convenient enough way to impl traits for anonymous types to make this an ergonomic solution for any problems under consideration. If the calling convention ever allows returning VLAs then a much better feature falls out naturally without the hackery. If that calling convention incurs any runtime costs, then the compiler could compile appropriate cases to sized returns under the hood. |
@burdges: It's more that I then applied the same reasoning to |
A trait exported from any crate could never benefit. I read you point two ways: First, adding impls can now break things within the same crate. Second, even if
|
(rendered)