Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler spends all available memory when large array is initialized with lazy_static #93215

Open
hniksic opened this issue Jan 22, 2022 · 14 comments
Labels
C-bug Category: This is a bug. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation.

Comments

@hniksic
Copy link
Contributor

hniksic commented Jan 22, 2022

I tried this code:

use std::sync::Mutex;
use lazy_static::lazy_static;

const ROWS: usize = 100000;
const COLS: usize = 10000;

lazy_static! {
    static ref TWODARRAY: Mutex<[[u128; COLS]; ROWS]> = Mutex::new([[0; COLS]; ROWS]);
} 

fn main() {
    let mut twodarray = TWODARRAY.lock().unwrap();
}

I expected it to compile successfully.

Instead, the compiler spent all available RAM and swap, and I had to kill it. From top:

26208 hniksic   20   0  0.102t 7.939g  86392 S  96.8 58.0   0:04.78 rustc                                               

Meta

rustc --version --verbose:

rustc 1.58.1 (db9d1b20b 2022-01-20)
binary: rustc
commit-hash: db9d1b20bba1968c1ec1fc49616d4742c1725b4b
commit-date: 2022-01-20
host: x86_64-unknown-linux-gnu
release: 1.58.1
LLVM version: 13.0.0
@hniksic hniksic added the C-bug Category: This is a bug. label Jan 22, 2022
@saethlin
Copy link
Member

In fairness to rustc, that array is 14.9 GB. If you only have 8 GB of memory, you'll never be able to run this program even if you do manage to compile it.

And if you do have enough memory to get through whatever's going on here, it looks to me like the incremental compilation system isn't prepared to handle this:

thread 'rustc' panicked at 'assertion failed: pos <= u32::MAX as usize', compiler/rustc_query_impl/src/on_disk_cache.rs:119:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: internal compiler error: unexpected panic

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md

note: rustc 1.60.0-dev running on x86_64-unknown-linux-gnu

note: compiler flags: -C embed-bitcode=no -C debuginfo=2 -C incremental --crate-type bin

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
end of query stack

(this is a local build because I want debug symbols)

All that said, I feel like this shouldn't be so memory-intensive. It looks like the memory consumption appears to be from the MIR interpreter's InitMask... I won't claim this because I'm truly lost in this area of the code but I'm looking.

@eddyb
Copy link
Member

eddyb commented Jan 23, 2022

It looks like the memory consumption appears to be from the MIR interpreter's InitMask...

So I believe what you have here is a very large static that's mostly uninitialized (except for presumably the Once in it?).
I forget the current implementation of InitMask, but if it's one bit per byte, the 14.9GiB static will require a 1.86GiB InitMask.

We could maybe reduce the cost of some of that tracking, but it's still not great overall.

Your executable would likely also need to be 14.9GiB in size, because the static isn't fully uninitialized, so AFAIK it can't go into .bss and instead has to be fully present in .data (unless LLVM gets clever?).


What you may want to do instead is Box your very large 2D array.

That results in successful compilation and this runtime message (on the playground):

memory allocation of 16000000000 bytes failed

@saethlin
Copy link
Member

saethlin commented Jan 23, 2022

Ah, the allocations are all happening along a code path that calls copy_op. So I suspect there's somehow a copy happening in the MIR of this array? And... a lot of copies??? (I clearly don't understand much of MIR). The Size being passed to InitMask::grow when they're created makes sense.

Peak memory of the build is 44.8 GB, btw.

@oli-obk
Copy link
Contributor

oli-obk commented Jan 23, 2022

Yea, there are a few extra copies of values happening that we could probably avoid, but we always need at least two at some point during CTFE.

It may help to implement a "deinit source on move" scheme. Basically CTFE treats Operand::Move and Operand::Copy the same.

@eddyb
Copy link
Member

eddyb commented Jan 23, 2022

So I suspect there's somehow a copy happening in the MIR of this array?

It's likely something like (Once, MaybeUninit<Mutex<[[T; N]; M]>>) - so the big array is in the same allocation, but that's not the thing being allocated and copied around, merely an uninitialized field of it.

The InitMask doesn't have to be that big, if the uninitialized data is a suffix of the allocation, but managing InitMask in the presence of clever encodings isn't easy. Supporting only an uninitialized suffix might be possible?

@saethlin
Copy link
Member

saethlin commented Jan 23, 2022

OP's example is insufficiently minimized. A build of this program peaks at 33.6 GB:

const ROWS: usize = 100000;
const COLS: usize = 10000;

static TWODARRAY: [[u128; COLS]; ROWS] = [[0; COLS]; ROWS];

fn main() {}

Perhaps that makes sense, if we're doing a copy at all of this huge array, because that's twice the size of the array. Right?

@hniksic
Copy link
Contributor Author

hniksic commented Jan 23, 2022

@saethlin

In fairness to rustc, that array is 14.9 GB. If you only have 8 GB of memory, you'll never be able to run this program even if you do manage to compile it.

I do have more than 8 GiB of memory, but rustc was spending 100+ GiB, which exceeds array size by far. And even if it didn't, I would hope to be able to allocate large arrays without the compiler having to reproduce the allocation. I often compile programs locally and then deploy them on servers with much more RAM, more CPU cores, etc.

@eddyb

What you may want to do instead is Box your very large 2D array.

I neglected to note in the issue that I am aware of the possibility of heap-allocating the array. In this case, however, I was explicitly experimenting with allocating the large array statically, akin to a C declaration of static int twodarray[ROWS][COLS]. It surprised me that attempting to do so broke the compiler, so I wanted to report it so it can hopefully get fixed or at least looked at.

@Mark-Simulacrum
Copy link
Member

It seems like replacing the InitMask (which is basically a BitSet, not sure why we have a distinct impl of it for MIR allocations?) with something like the recently-added IntervalSet may make sense -- presumably it's pretty rare for a truly random pattern of initialization to occur.

@eddyb
Copy link
Member

eddyb commented Jan 23, 2022

@Mark-Simulacrum I believe something like IntervalSet used to exist and was considered too cumbersome.
Maybe the presence of a reusable abstraction might be enough to let miri use that?


In this case, however, I was explicitly experimenting with allocating the large array statically, akin to a C declaration of static int twodarray[ROWS][COLS].

To what end? Accessing it won't be more efficient than a pointer to a heap area - passing a &mut [[u128; COLS]; ROWS] around as a function argument (i.e. no globals), pointing to the heap, is probably the best you can do.

Putting it in a static just makes the file on disk unnecessarily large, and changes whose responsibility it is (kernel vs your program) to allocate it. And mutation requires a Mutex that you might be able to avoid otherwise.

Out of curiosity, I tried something similar (tho probably not sound on non-trivially-copyable types, heh) in C++ (using constexpr to match Rust's behavior) and it looks like GCC is worse than Clang at this.


OP's example is insufficiently minimized. A build of this program peaks at 33.6 GB:

const ROWS: usize = 100000;
const COLS: usize = 10000;

static TWODARRAY: [[u128; COLS]; ROWS] = [[0; COLS]; ROWS];

fn main() {}

Perhaps that makes sense, if we're doing a copy at all of this huge array, because that's twice the size of the array. Right?

That's worse than the OP example (in terms of initialization state), I'd say this is closer:

pub static BIG: Option<[u128; 1 << 30]> = None;

For 1 << 25 instead of 1 << 30, even godbolt can compile it (and it even works if it has to evaluate a const fn).

Surprisingly, this is the best I can get to compile on godbolt even fully-uninitialized:

pub static BIG_UNINIT: MaybeUninit<[u128; 1 << 27]> = MaybeUninit::uninit();

(though you'll want to replace u128 with some atomic type, or use some other way to enable interior mutability, to get it to go into .bss instead of .rodata)

@saethlin
Copy link
Member

@Mark-Simulacrum I think I was mistaken to point to InitMask as a problem. It's surely where most of the compile time is coming from, but I think the memory consumption comes from MIR just doing a copy (in the minimal case) or two (in the case of lazy_static with a Mutex) of the intended array.

@eddyb

I believe something like IntervalSet used to exist and was considered too cumbersome.

It looks to me like there is currently InitMask and InitMaskCompressed, and in this program, the interpreter creates a full InitMask then attempts to do run-length encoding on it after-the-fact. I feel like this is just IntervalSet with extra steps?

Surprisingly, this is the best I can get to compile on godbolt even fully-uninitialized:

On godbolt I see "Compiler returned: 137" which from Googling means something else sent the compiler a SIGKILL. The last build that works peaks at 2.1 GB, and of course the next power of 2 is 4.1 GB. I think this is supposed to be unsurprising, because const eval has to construct the const. I suspect there's a (quite generous) 4 GB memory limit in Godbolt. So perhaps this is "as expected, but unfortunate"?

@hniksic

I do have more than 8 GiB of memory, but rustc was spending 100+ GiB, which exceeds array size by far.

Wowee, that's significantly worse. Can you share the lazy_static that was responsible for that amount of memory usage?

@hniksic
Copy link
Contributor Author

hniksic commented Jan 24, 2022

Can you share the lazy_static that was responsible for that amount of memory usage?

That should actually be the original code in the report. The line from top shows the compiler having allocated 100 GiB virtual memory (and presumably not getting killed due to overcommit). Maybe the difference is that I was compiling it in release mode?

@oli-obk
Copy link
Contributor

oli-obk commented Jan 25, 2022

Supporting only an uninitialized suffix might be possible?

yea, we could lazily grow it and treat anything beyond the size as uninitialized.

@RalfJung
Copy link
Member

It looks to me like there is currently InitMask and InitMaskCompressed, and in this program, the interpreter creates a full InitMask then attempts to do run-length encoding on it after-the-fact. I feel like this is just IntervalSet with extra steps?

Basically, we have InitMask to store the mask of an allocation, and InitMaskCompressed is used when copying the mask from one allocation to another (which might overlap, and we might be asked to repeat this N times, so we have to keep a copy of the original mask).

If it is possible, without perf loss for typical CTFE, to use the same data structure for both of these cases, that would be nice. :D But I think the problem is that testing if bit i is set in an InitMaskCompressed actually costs linear time since we have to scan the list of intervals.

@tmiasko
Copy link
Contributor

tmiasko commented Jan 30, 2022

But I think the problem is that testing if bit i is set in an InitMaskCompressed actually costs linear time since we have to scan the list of intervals.

If intervals lengths were to be replaced with prefix sums of their lengths test if i-th bit is set can be performed in logarithmic time.

@Mark-Simulacrum Mark-Simulacrum self-assigned this Feb 27, 2022
@Mark-Simulacrum Mark-Simulacrum added the I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. label Feb 27, 2022
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 2, 2022
CTFE interning: don't walk allocations that don't need it

The interning of const allocations visits the mplace looking for references to intern. Walking big aggregates like big static arrays can be costly, so we only do it if the allocation we're interning contains references or interior mutability.

Walking ZSTs was avoided before, and this optimization is now applied to cases where there are no references/relocations either.

---

While initially looking at this in the context of rust-lang#93215, I've been testing with smaller allocations than the 16GB one in that issue, and with different init/uninit patterns (esp. via padding).

In that example, by default, `eval_to_allocation_raw` is the heaviest query followed by `incr_comp_serialize_result_cache`. So I'll show numbers when incremental compilation is disabled, to focus on the const allocations themselves at 95% of the compilation time, at bigger array sizes on these minimal examples like `static ARRAY: [u64; LEN] = [0; LEN];`.

That is a close construction to parts of the `ctfe-stress-test-5` benchmark, which has const allocations in the megabytes, while most crates usually have way smaller ones. This PR will have the most impact in these situations, as the walk during the interning starts to dominate the runtime.

Unicode crates (some of which are present in our benchmarks) like `ucd`, `encoding_rs`, etc come to mind as having bigger than usual allocations as well, because of big tables of code points (in the hundreds of KB, so still an order of magnitude or 2 less than the stress test).

In a check build, for a single static array shown above, from 100 to 10^9 u64s (for lengths in powers of ten), the constant factors are lowered:

(log scales for easier comparisons)
![plot_log](https://user-images.githubusercontent.com/247183/171422958-16f1ea19-3ed4-4643-812c-1c7c60a97e19.png)

(linear scale for absolute diff at higher Ns)
![plot_linear](https://user-images.githubusercontent.com/247183/171401886-2a869a4d-5cd5-47d3-9a5f-8ce34b7a6917.png)

For one of the alternatives of that issue
```rust
const ROWS: usize = 100_000;
const COLS: usize = 10_000;

static TWODARRAY: [[u128; COLS]; ROWS] = [[0; COLS]; ROWS];
```

we can see a similar reduction of around 3x (from 38s to 12s or so).

For the same size, the slowest case IIRC is when there are uninitialized bytes e.g. via padding

```rust
const ROWS: usize = 100_000;
const COLS: usize = 10_000;

static TWODARRAY: [[(u64, u8); COLS]; ROWS] = [[(0, 0); COLS]; ROWS];
```
then interning/walking does not dominate anymore (but means there is likely still some interesting work left to do here).

Compile times in this case rise up quite a bit, and avoiding interning walks has less impact: around 23%, from 730s on master to 568s with this PR.
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 29, 2023
Make init mask lazy for fully initialized/uninitialized const allocations

There are a few optimization opportunities in the `InitMask` and related const `Allocation`s (e.g. by taking advantage of the fact that it's a bitset that represents initialization, which is often entirely initialized or uninitialized in a single call, or gradually built up, etc).

There's a few overwrites to the same state, multiple writes in a row to the same indices, the RLE scheme for `memcpy` doesn't always compress, etc.

Here, we start with:
- avoiding materializing the bitset's blocks if the allocation is fully initialized/uninitialized
- dealloc blocks when fully overwriting, including when participating in `memcpy`s
- take care of the fixme about allocating blocks of 0s before overwriting them to the expected value
- expanding unit test coverage of the init mask

This should be most visible on benchmarks and crates where const allocations dominate the runtime (like `ctfe-stress-5` of course), but I was especially looking at the worst cases from rust-lang#93215.

This first change allows the majority of `set_range` calls to stay with a lazy init mask when bootstrapping rustc (not that the init mask is a big part of the process in cpu time or memory usage).

r? `@oli-obk`

I have another in-progress branch where I'll switch the singular initialized/uninitialized value to a watermark, recording the point after which everything is uninitialized. That will take care of cases where full initialization is monotonic and done in multiple steps (e.g. an array of a type without padding), which should then allow the vast majority of const allocations' init masks to stay lazy during bootstrapping (though interestingly I've seen such gradual initialization in both left-to-right and right-to-left directions, and I don't think a single watermark can handle both).
@Mark-Simulacrum Mark-Simulacrum removed their assignment Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-compilemem Issue: Problems and improvements with respect to memory usage during compilation.
Projects
None yet
Development

No branches or pull requests

7 participants