-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large files containing many tokens of const
data compile very slowly and use a lot of memory (in MIR_borrow_checking and expand_crate)
#134404
Comments
Due to the query system this can also be const eval being invoked and generating and interning lots of allocations. So #93215 may be related I don't remember how to get the diff or single output that's used to generate the table in https://perf.rust-lang.org/detailed-query.html?commit=52f4785f80c1516ebece019ae4b69763ffb9a618&benchmark=ripgrep-13.0.0-opt&scenario=incr-unchanged&base_commit=5afd5ad29c014de69bea61d028a1ce832ed75a75 but that gives per query timings. Just in case you want to dig some more |
You can get the raw data with |
And without the macro:
So most of the time is in mir building and const qualif, not in borrowck per se. |
Hmm. There's def opportunity to improve const qualification cc @RalfJung |
Sweet, thanks! If it's something straightforward enough I'm happy to help out too. Would any of these fixes help with the RAM? The performance is an issue but it's not the blocker, the RAM usage is one that actively limits us at times. |
Hmm.. that's harder to debug. The mir const qualif query returns an always-tiny result, so that's not it. We could probably dump size changes of queries without nested queries, similar to the self time. But that's harder We may be able to trim the borrowck result if that's the issue. Rustc may not need all the output anymore |
You may be in luck as that RAM usage seems to come from const qualif as well 😅. It's understandable, really, it's trying to do analyses on MIR that has 250K locals and 730K statements. |
Ah, transient peak memory? It shouldn't generate much persistent data |
Seems like it, yeah, since we're talking about max-rss. And I wonder, if you remember the recent change in dataflow removing the acyclic MIR fast path (which seems a likely shape for a constant), if the xfer function is now cloning the state (I hope it still moves it if there's a single successor; but at the same time there may be unwinding paths all around) because there are 200K blocks. I'd need to check. Also we should try the mixed bitsets (but I haven't checked the distribution of values here, I'd believe they should be very compressible tho). |
There probably is, but it's hard to say where without knowing which part is the bottleneck. Is there any way to get a profile of where const-qualif is spending its time? We do have to iterate over that array at least once. But I doubt that alone would be so slow -- how many elements does this array have? Maybe there's something accidentally quadratic in promotion? Not sure if that is also part of mir_const_qualif. |
Approximately 25,000. Each element looks the same, a nested struct constructor expression with a string at the center. The string can vary in length. |
Manishearth/icu4x_compile_sample> hyperfine -w2 -r5 --prepare "cargo clean" -L rustc 1d35638dc38dbfbf1cc2a9823135dfcf3c650169,8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 "cargo +{rustc} check"
Benchmark 1: cargo +1d35638dc38dbfbf1cc2a9823135dfcf3c650169 check
Time (mean ± σ): 34.761 s ± 0.464 s [User: 21.684 s, System: 13.076 s]
Range (min … max): 34.010 s … 35.210 s 5 runs
Benchmark 2: cargo +8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 check
Time (mean ± σ): 20.605 s ± 0.049 s [User: 18.423 s, System: 2.183 s]
Range (min … max): 20.533 s … 20.657 s 5 runs
Summary
cargo +8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 check ran
1.69 ± 0.02 times faster than cargo +1d35638dc38dbfbf1cc2a9823135dfcf3c650169 check > summarize summarize profiles/demo2-2187676.mm_profdata | grep mir_const_qualif
| mir_const_qualif | 14.79s | 43.514 | 14.79s | 3 | 4.61µs
> summarize summarize profiles/demo2-2187708.mm_profdata | grep mir_const_qualif
| mir_const_qualif | 1.00s | 4.931 | 1.00s | 3 | 3.94µs (BTW, nightly uses a bit more than 1GB) > cargo clean -q && /usr/bin/time -v cargo +1d35638dc38dbfbf1cc2a9823135dfcf3c650169 check -q 2>&1 | grep "Maximum"
Maximum resident set size (kbytes): 14821512
> cargo clean -q && /usr/bin/time -v cargo +8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 check -q 2>&1 | grep "Maximum"
Maximum resident set size (kbytes): 1871272 We'll see how much it impacts regular CFGs when the perf run concludes. Worst case we tune the cutoff between dense and sparse bitsets... |
@Manishearth as I'm not sure whether the CI limits you're hitting are only on max-rss or another metric, let us know how #134438 works for you when it lands in a nightly. |
I don't know either, but I will observe with that nightly! We're currently not hitting limits due to some optimizations we made (reducing tokens in the const code). When we were hitting limits it was intermittent CI stuff. So for me to measure this we'd need to
which I may not actually do. But I think max-rss is probably correct. |
I assume this is referring to #131481? I think the same cloning would happen before/after that change, though I could be wrong. That PR merged in 1.84.0 (currently beta) so it would be easy to check, assuming ICU4X can be built with stable. |
ICU4X does build on stable, as does the reduced testcase. What versions do you want me to compare? |
Let me just compare stable and beta with RUSTC_BOOTSTRAP=1 to use time-passes |
Tiny bit faster, takes up a bit more memory during expansion, a bunch more memory during typechecking but a bunch less memory during MIR. Tested with the reduced testcase, not ICU4X itself.
Stable (1.83)
Beta (1.84)
|
It looks unrelated then, and using a mixed bitset is enough to fix the issue for me. I haven’t looked into it either. It was a possibility because of the 200K memory allocations for dataflow in const qualif: iirc on acyclic cfgs, some analyses can be run in a single pass on RPO (again I haven’t checked that this case is acyclic or that this analysis could do this optimization ) |
…-errors Use `MixedBitSet`s in const qualif These analyses' domains should be very homogeneous, having compressed bitmaps on huge cfgs should make a difference (and doesn’t have an impact on the smaller / regular cfgs in our benchmarks). This is a >40% walltime reduction on [this stress test](https://github.com/Manishearth/icu4x_compile_sample) extracted from a real world ICU case, and a 10x or so max-rss reduction. cc `@oli-obk` `@RalfJung` Should help with (or fix) issue rust-lang#134404.
ICU4X has a concept of "baked data", a way of "baking" locale data into the source of a program in the form of consts. This has a bunch of performance benefits: loading data from the binary is essentially free and doesn't involve any sort of deserialization.
However, we have been facing issues with cases where a single crate contains a lot of data.
I have a minimal testcase here: https://github.com/Manishearth/icu4x_compile_sample. It removes most of the cruft whilst still having an interesting-enough AST in the const data.
cargo build
in thedemo
folder takes 51s, using almost a gigabyte of RAM. Removing the macro does improve things slightly, but not overly slow.Some interesting snippets of
time-passes
:Full time-passes
Even without the intermediate macro,
expand_crate
still increases RAM significantly, though the increase is halved:I understand that to some extent, we are simply feeding Rust a file that is megabytes in size and we cannot expect it to be too fast. It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check. I suspect there is MIR construction happening here too). The fact that the RAM usage is almost in the gigabytes is also somewhat concerning; the problematic source file is 7MB, but compilation takes a gigabyte of RAM, which is quite significant. Pair this with the fact that we have many such data files per crate (some of which are large) we end up hitting CI limits.
With the actual problem we were facing (unicode-org/icu4x#5230 (comment)), our time-passes numbers were:
I'm hoping there is at least some low hanging fruit that can be improved here, or advice on how to avoid this problem. So far we've managed to stay within CI limits by reducing the number of tokens, converting stuff like
icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { strings: icu::plurals::provider::PluralElementsPackedCow { elements: alloc::borrow::Cow::Borrowed(unsafe { icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b"\0\x01 acre") }) }, _phantom: core::marker::PhantomData } },
intoicu::experimental::dimension::provider::units::UnitsDisplayNameV1::new_baked(b"\0\x01 acre")
. This works to some extent but the problems remain in the same order of magnitude and can recur as we add more data.The text was updated successfully, but these errors were encountered: