Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large files containing many tokens of const data compile very slowly and use a lot of memory (in MIR_borrow_checking and expand_crate) #134404

Open
Manishearth opened this issue Dec 17, 2024 · 19 comments
Labels
I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Manishearth
Copy link
Member

Manishearth commented Dec 17, 2024

ICU4X has a concept of "baked data", a way of "baking" locale data into the source of a program in the form of consts. This has a bunch of performance benefits: loading data from the binary is essentially free and doesn't involve any sort of deserialization.

However, we have been facing issues with cases where a single crate contains a lot of data.

I have a minimal testcase here: https://github.com/Manishearth/icu4x_compile_sample. It removes most of the cruft whilst still having an interesting-enough AST in the const data. cargo build in the demo folder takes 51s, using almost a gigabyte of RAM. Removing the macro does improve things slightly, but not overly slow.

Some interesting snippets of time-passes:

...
time:   1.194; rss:   52MB ->  595MB ( +543MB)	expand_crate
time:   1.194; rss:   52MB ->  595MB ( +543MB)	macro_expand_crate
...
time:   3.720; rss:  682MB ->  837MB ( +155MB)	type_check_crate
...
time:  55.505; rss:  837MB -> 1058MB ( +221MB)	MIR_borrow_checking
...
time:   0.124; rss: 1080MB ->  624MB ( -456MB)	free_global_ctxt
Full time-passes
time:   0.001; rss:   47MB ->   49MB (   +1MB)	parse_crate
time:   0.001; rss:   50MB ->   50MB (   +0MB)	incr_comp_prepare_session_directory
time:   0.000; rss:   50MB ->   51MB (   +1MB)	setup_global_ctxt
time:   0.000; rss:   52MB ->   52MB (   +0MB)	crate_injection
time:   1.194; rss:   52MB ->  595MB ( +543MB)	expand_crate
time:   1.194; rss:   52MB ->  595MB ( +543MB)	macro_expand_crate
time:   0.013; rss:  595MB ->  595MB (   +0MB)	AST_validation
time:   0.008; rss:  595MB ->  597MB (   +1MB)	finalize_macro_resolutions
time:   0.285; rss:  597MB ->  642MB (  +45MB)	late_resolve_crate
time:   0.012; rss:  642MB ->  642MB (   +0MB)	resolve_check_unused
time:   0.020; rss:  642MB ->  642MB (   +0MB)	resolve_postprocess
time:   0.326; rss:  595MB ->  642MB (  +46MB)	resolve_crate
time:   0.011; rss:  610MB ->  610MB (   +0MB)	write_dep_info
time:   0.011; rss:  610MB ->  611MB (   +0MB)	complete_gated_feature_checking
time:   0.058; rss:  765MB ->  729MB (  -35MB)	drop_ast
time:   1.213; rss:  610MB ->  681MB (  +71MB)	looking_for_derive_registrar
time:   1.421; rss:  610MB ->  682MB (  +72MB)	misc_checking_1
time:   0.086; rss:  682MB ->  690MB (   +8MB)	coherence_checking
time:   3.720; rss:  682MB ->  837MB ( +155MB)	type_check_crate
time:   0.000; rss:  837MB ->  837MB (   +0MB)	MIR_coroutine_by_move_body
time:  55.505; rss:  837MB -> 1058MB ( +221MB)	MIR_borrow_checking
time:   1.571; rss: 1058MB -> 1068MB (  +10MB)	MIR_effect_checking
time:   0.217; rss: 1068MB -> 1067MB (   -1MB)	module_lints
time:   0.217; rss: 1068MB -> 1067MB (   -1MB)	lint_checking
time:   0.311; rss: 1067MB -> 1068MB (   +0MB)	privacy_checking_modules
time:   0.607; rss: 1068MB -> 1068MB (   +0MB)	misc_checking_3
time:   0.000; rss: 1136MB -> 1137MB (   +1MB)	monomorphization_collector_graph_walk
time:   0.778; rss: 1068MB -> 1064MB (   -4MB)	generate_crate_metadata
time:   0.005; rss: 1064MB -> 1085MB (  +22MB)	codegen_to_LLVM_IR
time:   0.007; rss: 1076MB -> 1085MB (  +10MB)	LLVM_passes
time:   0.014; rss: 1064MB -> 1085MB (  +22MB)	codegen_crate
time:   0.257; rss: 1084MB -> 1080MB (   -4MB)	encode_query_results
time:   0.270; rss: 1084MB -> 1080MB (   -4MB)	incr_comp_serialize_result_cache
time:   0.270; rss: 1084MB -> 1080MB (   -4MB)	incr_comp_persist_result_cache
time:   0.271; rss: 1084MB -> 1080MB (   -4MB)	serialize_dep_graph
time:   0.124; rss: 1080MB ->  624MB ( -456MB)	free_global_ctxt
time:   0.000; rss:  624MB ->  624MB (   +0MB)	finish_ongoing_codegen
time:   0.127; rss:  624MB ->  653MB (  +29MB)	link_rlib
time:   0.135; rss:  624MB ->  653MB (  +29MB)	link_binary
time:   0.138; rss:  624MB ->  618MB (   -6MB)	link_crate
time:   0.139; rss:  624MB ->  618MB (   -6MB)	link
time:  65.803; rss:   32MB ->  187MB ( +155MB)	total

Even without the intermediate macro, expand_crate still increases RAM significantly, though the increase is halved:

time:   0.715; rss:   52MB ->  254MB ( +201MB)	expand_crate
time:   0.715; rss:   52MB ->  254MB ( +201MB)	macro_expand_crate

I understand that to some extent, we are simply feeding Rust a file that is megabytes in size and we cannot expect it to be too fast. It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check. I suspect there is MIR construction happening here too). The fact that the RAM usage is almost in the gigabytes is also somewhat concerning; the problematic source file is 7MB, but compilation takes a gigabyte of RAM, which is quite significant. Pair this with the fact that we have many such data files per crate (some of which are large) we end up hitting CI limits.

With the actual problem we were facing (unicode-org/icu4x#5230 (comment)), our time-passes numbers were:

...
time:   1.013; rss:   51MB -> 1182MB (+1130MB)	expand_crate
time:   1.013; rss:   51MB -> 1182MB (+1131MB)	macro_expand_crate
...
time:   6.609; rss: 1308MB -> 1437MB ( +128MB)	type_check_crate
time:  36.802; rss: 1437MB -> 2248MB ( +811MB)	MIR_borrow_checking
time:   2.214; rss: 2248MB -> 2270MB (  +22MB)	MIR_effect_checking
...

I'm hoping there is at least some low hanging fruit that can be improved here, or advice on how to avoid this problem. So far we've managed to stay within CI limits by reducing the number of tokens, converting stuff like icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { strings: icu::plurals::provider::PluralElementsPackedCow { elements: alloc::borrow::Cow::Borrowed(unsafe { icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b"\0\x01 acre") }) }, _phantom: core::marker::PhantomData } }, into icu::experimental::dimension::provider::units::UnitsDisplayNameV1::new_baked(b"\0\x01 acre"). This works to some extent but the problems remain in the same order of magnitude and can recur as we add more data.

@Manishearth Manishearth added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 17, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 17, 2024
@oli-obk
Copy link
Contributor

oli-obk commented Dec 17, 2024

It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check.

Due to the query system this can also be const eval being invoked and generating and interning lots of allocations. So #93215 may be related

I don't remember how to get the diff or single output that's used to generate the table in https://perf.rust-lang.org/detailed-query.html?commit=52f4785f80c1516ebece019ae4b69763ffb9a618&benchmark=ripgrep-13.0.0-opt&scenario=incr-unchanged&base_commit=5afd5ad29c014de69bea61d028a1ce832ed75a75 but that gives per query timings. Just in case you want to dig some more

@lqd
Copy link
Member

lqd commented Dec 17, 2024

You can get the raw data with -Zself-profile and use measureme tools on the .mm_profdata raw data with summarize summarize $file for the query summary, and iirc summarize diff with 2 files to get the diff oli linked.

@jieyouxu jieyouxu added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 17, 2024
@lqd
Copy link
Member

lqd commented Dec 17, 2024

+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item                                                                    | Self time | % of total time | Time     | Item count | Incremental result hashing time |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_const_qualif                                                        | 16.22s    | 45.785          | 16.22s   | 3          | 4.56µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built                                                               | 9.98s     | 28.160          | 10.09s   | 4          | 96.73ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck                                                            | 2.49s     | 7.030           | 19.02s   | 4          | 5.70µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck                                                                  | 2.27s     | 6.420           | 2.36s    | 4          | 23.85ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| expand_crate                                                            | 942.46ms  | 2.660           | 952.89ms | 1          | 0.00ns                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+ 
...
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
Total cpu time: 35.430183553s
...

And without the macro:

+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item                                                                    | Self time | % of total time | Time     | Item count | Incremental result hashing time |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_const_qualif                                                        | 14.88s    | 45.361          | 14.88s   | 3          | 5.76µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built                                                               | 9.86s     | 30.074          | 9.97s    | 4          | 61.53ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck                                                            | 2.50s     | 7.612           | 17.67s   | 4          | 7.64µs                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck                                                                  | 2.27s     | 6.922           | 2.35s    | 4          | 23.99ms                         |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| expand_crate                                                            | 454.17ms  | 1.385           | 477.20ms | 1          | 0.00ns                          |
+-------------------------------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
...
Total cpu time: 32.792896135s
...

So most of the time is in mir building and const qualif, not in borrowck per se.

@oli-obk
Copy link
Contributor

oli-obk commented Dec 17, 2024

Hmm. There's def opportunity to improve const qualification

cc @RalfJung

@Manishearth
Copy link
Member Author

Sweet, thanks! If it's something straightforward enough I'm happy to help out too.

Would any of these fixes help with the RAM? The performance is an issue but it's not the blocker, the RAM usage is one that actively limits us at times.

@oli-obk
Copy link
Contributor

oli-obk commented Dec 17, 2024

Hmm.. that's harder to debug. The mir const qualif query returns an always-tiny result, so that's not it. We could probably dump size changes of queries without nested queries, similar to the self time. But that's harder

We may be able to trim the borrowck result if that's the issue. Rustc may not need all the output anymore

@lqd
Copy link
Member

lqd commented Dec 17, 2024

Would any of these fixes help with the RAM?

You may be in luck as that RAM usage seems to come from const qualif as well 😅. It's understandable, really, it's trying to do analyses on MIR that has 250K locals and 730K statements.

@oli-obk
Copy link
Contributor

oli-obk commented Dec 17, 2024

Ah, transient peak memory? It shouldn't generate much persistent data

@lqd
Copy link
Member

lqd commented Dec 17, 2024

Seems like it, yeah, since we're talking about max-rss.

And I wonder, if you remember the recent change in dataflow removing the acyclic MIR fast path (which seems a likely shape for a constant), if the xfer function is now cloning the state (I hope it still moves it if there's a single successor; but at the same time there may be unwinding paths all around) because there are 200K blocks. I'd need to check.

Also we should try the mixed bitsets (but I haven't checked the distribution of values here, I'd believe they should be very compressible tho).

@jieyouxu jieyouxu added the I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. label Dec 17, 2024
@RalfJung
Copy link
Member

There's def opportunity to improve const qualification

There probably is, but it's hard to say where without knowing which part is the bottleneck. Is there any way to get a profile of where const-qualif is spending its time?

We do have to iterate over that array at least once. But I doubt that alone would be so slow -- how many elements does this array have?

Maybe there's something accidentally quadratic in promotion? Not sure if that is also part of mir_const_qualif.

@Manishearth
Copy link
Member Author

We do have to iterate over that array at least once. But I doubt that alone would be so slow -- how many elements does this array have?

Approximately 25,000. Each element looks the same, a nested struct constructor expression with a string at the center. The string can vary in length.

@lqd
Copy link
Member

lqd commented Dec 17, 2024

Also we should try the mixed bitsets

Manishearth/icu4x_compile_sample> hyperfine -w2 -r5 --prepare "cargo clean" -L rustc 1d35638dc38dbfbf1cc2a9823135dfcf3c650169,8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 "cargo +{rustc} check"
Benchmark 1: cargo +1d35638dc38dbfbf1cc2a9823135dfcf3c650169 check
  Time (mean ± σ):     34.761 s ±  0.464 s    [User: 21.684 s, System: 13.076 s]
  Range (min … max):   34.010 s … 35.210 s    5 runs

Benchmark 2: cargo +8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 check
  Time (mean ± σ):     20.605 s ±  0.049 s    [User: 18.423 s, System: 2.183 s]
  Range (min … max):   20.533 s … 20.657 s    5 runs

Summary
  cargo +8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 check ran
    1.69 ± 0.02 times faster than cargo +1d35638dc38dbfbf1cc2a9823135dfcf3c650169 check
> summarize summarize profiles/demo2-2187676.mm_profdata | grep mir_const_qualif
| mir_const_qualif    | 14.79s    | 43.514          | 14.79s   | 3          | 4.61µs
> summarize summarize profiles/demo2-2187708.mm_profdata | grep mir_const_qualif
| mir_const_qualif    | 1.00s     | 4.931           | 1.00s    | 3          | 3.94µs

(BTW, nightly uses a bit more than 1GB)

> cargo clean -q && /usr/bin/time -v cargo +1d35638dc38dbfbf1cc2a9823135dfcf3c650169 check -q 2>&1 | grep "Maximum"
        Maximum resident set size (kbytes): 14821512
> cargo clean -q && /usr/bin/time -v cargo +8280a60cd16db5b22b55e7c7d6b9f2d8a3960a20 check -q 2>&1 | grep "Maximum"
        Maximum resident set size (kbytes): 1871272

We'll see how much it impacts regular CFGs when the perf run concludes. Worst case we tune the cutoff between dense and sparse bitsets...

@lqd
Copy link
Member

lqd commented Dec 17, 2024

@Manishearth as I'm not sure whether the CI limits you're hitting are only on max-rss or another metric, let us know how #134438 works for you when it lands in a nightly.

@Manishearth
Copy link
Member Author

I don't know either, but I will observe with that nightly! We're currently not hitting limits due to some optimizations we made (reducing tokens in the const code). When we were hitting limits it was intermittent CI stuff. So for me to measure this we'd need to

  • move our CI to nightly
  • undo the optimizations
  • wait a few days

which I may not actually do. But I think max-rss is probably correct.

@nnethercote
Copy link
Contributor

And I wonder, if you remember the recent change in dataflow removing the acyclic MIR fast path (which seems a likely shape for a constant), if the xfer function is now cloning the state (I hope it still moves it if there's a single successor; but at the same time there may be unwinding paths all around) because there are 200K blocks. I'd need to check.

I assume this is referring to #131481? I think the same cloning would happen before/after that change, though I could be wrong. That PR merged in 1.84.0 (currently beta) so it would be easy to check, assuming ICU4X can be built with stable.

@Manishearth
Copy link
Member Author

ICU4X does build on stable, as does the reduced testcase. What versions do you want me to compare?

@Manishearth
Copy link
Member Author

Let me just compare stable and beta with RUSTC_BOOTSTRAP=1 to use time-passes

@Manishearth
Copy link
Member Author

Manishearth commented Dec 18, 2024

Tiny bit faster, takes up a bit more memory during expansion, a bunch more memory during typechecking but a bunch less memory during MIR. Tested with the reduced testcase, not ICU4X itself.

Stable:
time:   1.173; rss:   53MB ->  601MB ( +548MB)	expand_crate
time:   1.173; rss:   53MB ->  601MB ( +548MB)	macro_expand_crate
time:   3.917; rss:  659MB ->  776MB ( +117MB)	type_check_crate
time:  49.468; rss:  776MB -> 1098MB ( +322MB)	MIR_borrow_checking

Beta:
time:   1.243; rss:   51MB ->  601MB ( +550MB)	expand_crate
time:   1.244; rss:   51MB ->  601MB ( +551MB)	macro_expand_crate
time:   3.831; rss:  637MB ->  815MB ( +178MB)	type_check_crate
time:  47.344; rss:  815MB -> 1089MB ( +274MB)	MIR_borrow_checking
Stable (1.83)
[17:54:56] मanishearth@manishearth-glaptop2 ~/dev/Git/icu4x_compile_sample ^_^ 
$ cargo clean; RUSTC_BOOTSTRAP=1 RUSTFLAGS="-Ztime-passes" /usr/bin/time -v cargo +stable build -j1 --all-features
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
     Removed 18 files, 104.6MiB total
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
   Compiling demo2 v0.1.0 (/home/manishearth/dev/Git/icu4x_compile_sample/demo)
time:   0.001; rss:   49MB ->   50MB (   +1MB)	parse_crate
time:   0.001; rss:   51MB ->   51MB (   +0MB)	incr_comp_prepare_session_directory
time:   0.000; rss:   51MB ->   52MB (   +1MB)	setup_global_ctxt
time:   0.000; rss:   53MB ->   53MB (   +0MB)	crate_injection
time:   1.173; rss:   53MB ->  601MB ( +548MB)	expand_crate
time:   1.173; rss:   53MB ->  601MB ( +548MB)	macro_expand_crate
time:   0.014; rss:  601MB ->  601MB (   +0MB)	AST_validation
time:   0.001; rss:  601MB ->  603MB (   +2MB)	finalize_macro_resolutions
time:   0.287; rss:  603MB ->  650MB (  +47MB)	late_resolve_crate
time:   0.012; rss:  650MB ->  650MB (   +0MB)	resolve_check_unused
time:   0.023; rss:  650MB ->  651MB (   +0MB)	resolve_postprocess
time:   0.323; rss:  601MB ->  651MB (  +49MB)	resolve_crate
time:   0.015; rss:  621MB ->  621MB (   +0MB)	write_dep_info
time:   0.014; rss:  622MB ->  622MB (   +0MB)	complete_gated_feature_checking
time:   0.058; rss:  757MB ->  693MB (  -64MB)	drop_ast
time:   1.299; rss:  621MB ->  657MB (  +36MB)	looking_for_derive_registrar
time:   1.495; rss:  621MB ->  659MB (  +37MB)	misc_checking_1
time:   0.075; rss:  659MB ->  666MB (   +7MB)	coherence_checking
time:   3.917; rss:  659MB ->  776MB ( +117MB)	type_check_crate
time:  49.468; rss:  776MB -> 1098MB ( +322MB)	MIR_borrow_checking
time:   0.955; rss: 1098MB -> 1070MB (  -29MB)	MIR_effect_checking
time:   0.070; rss: 1070MB -> 1070MB (   +0MB)	module_lints
time:   0.070; rss: 1070MB -> 1070MB (   +0MB)	lint_checking
time:   0.197; rss: 1070MB -> 1070MB (   +0MB)	privacy_checking_modules
time:   0.331; rss: 1070MB -> 1070MB (   +0MB)	misc_checking_3
time:   0.005; rss: 1144MB -> 1145MB (   +1MB)	monomorphization_collector_graph_walk
time:   0.001; rss: 1145MB -> 1145MB (   +0MB)	partition_and_assert_distinct_symbols
time:   0.507; rss: 1070MB -> 1071MB (   +1MB)	generate_crate_metadata
time:   0.016; rss: 1071MB -> 1091MB (  +20MB)	codegen_to_LLVM_IR
time:   0.023; rss: 1083MB -> 1090MB (   +7MB)	LLVM_passes
time:   0.042; rss: 1071MB -> 1090MB (  +19MB)	codegen_crate
time:   0.001; rss: 1089MB -> 1083MB (   -6MB)	incr_comp_persist_dep_graph
time:   0.139; rss: 1080MB -> 1079MB (   -1MB)	encode_query_results
time:   0.147; rss: 1081MB -> 1079MB (   -2MB)	incr_comp_serialize_result_cache
time:   0.147; rss: 1083MB -> 1079MB (   -3MB)	incr_comp_persist_result_cache
time:   0.148; rss: 1090MB -> 1079MB (  -11MB)	serialize_dep_graph
time:   0.104; rss: 1079MB ->  646MB ( -434MB)	free_global_ctxt
time:   0.002; rss:  646MB ->  646MB (   +0MB)	finish_ongoing_codegen
time:   0.122; rss:  646MB ->  681MB (  +35MB)	link_rlib
time:   0.127; rss:  646MB ->  681MB (  +35MB)	link_binary
time:   0.131; rss:  646MB ->  646MB (   +0MB)	link_crate
time:   0.134; rss:  646MB ->  646MB (   +0MB)	link
time:  58.742; rss:   33MB ->  189MB ( +156MB)	total
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 58.88s
	Command being timed: "cargo +stable build -j1 --all-features"
	User time (seconds): 44.88
	System time (seconds): 13.91
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:58.90
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 14853340
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 201
	Minor (reclaiming a frame) page faults: 3835408
	Voluntary context switches: 519
	Involuntary context switches: 1050
	Swaps: 0
	File system inputs: 41128
	File system outputs: 211744
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
Beta (1.84)
[17:56:13] मanishearth@manishearth-glaptop2 ~/dev/Git/icu4x_compile_sample ^_^ 
$ cargo clean; RUSTC_BOOTSTRAP=1 RUSTFLAGS="-Ztime-passes" /usr/bin/time -v cargo +beta build -j1 --all-features
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
     Removed 18 files, 136.5MiB total
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
   Compiling demo2 v0.1.0 (/home/manishearth/dev/Git/icu4x_compile_sample/demo)
time:   0.003; rss:   47MB ->   48MB (   +1MB)	parse_crate
time:   0.001; rss:   49MB ->   49MB (   +0MB)	incr_comp_prepare_session_directory
time:   0.003; rss:   49MB ->   50MB (   +1MB)	setup_global_ctxt
time:   0.001; rss:   51MB ->   51MB (   +0MB)	crate_injection
time:   1.243; rss:   51MB ->  601MB ( +550MB)	expand_crate
time:   1.244; rss:   51MB ->  601MB ( +551MB)	macro_expand_crate
time:   0.013; rss:  601MB ->  602MB (   +0MB)	AST_validation
time:   0.001; rss:  602MB ->  602MB (   +0MB)	finalize_imports
time:   0.000; rss:  602MB ->  602MB (   +0MB)	compute_effective_visibilities
time:   0.007; rss:  602MB ->  603MB (   +2MB)	finalize_macro_resolutions
time:   0.278; rss:  603MB ->  650MB (  +47MB)	late_resolve_crate
time:   0.012; rss:  650MB ->  650MB (   +0MB)	resolve_check_unused
time:   0.027; rss:  650MB ->  650MB (   +0MB)	resolve_postprocess
time:   0.325; rss:  602MB ->  650MB (  +49MB)	resolve_crate
time:   0.015; rss:  621MB ->  621MB (   +0MB)	write_dep_info
time:   0.012; rss:  621MB ->  621MB (   +0MB)	complete_gated_feature_checking
time:   0.064; rss:  734MB ->  672MB (  -61MB)	drop_ast
time:   1.231; rss:  621MB ->  637MB (  +16MB)	looking_for_derive_registrar
time:   0.000; rss:  637MB ->  637MB (   +0MB)	unused_lib_feature_checking
time:   1.424; rss:  621MB ->  637MB (  +16MB)	misc_checking_1
time:   0.094; rss:  637MB ->  643MB (   +5MB)	coherence_checking
time:   3.831; rss:  637MB ->  815MB ( +178MB)	type_check_crate
time:  47.344; rss:  815MB -> 1089MB ( +274MB)	MIR_borrow_checking
time:   0.952; rss: 1089MB -> 1066MB (  -22MB)	MIR_effect_checking
time:   0.130; rss: 1066MB -> 1067MB (   +0MB)	module_lints
time:   0.130; rss: 1066MB -> 1067MB (   +0MB)	lint_checking
time:   0.204; rss: 1067MB -> 1067MB (   +0MB)	privacy_checking_modules
time:   0.394; rss: 1066MB -> 1067MB (   +0MB)	misc_checking_3
time:   0.002; rss: 1140MB -> 1140MB (   +0MB)	monomorphization_collector_root_collections
time:   0.004; rss: 1140MB -> 1141MB (   +1MB)	monomorphization_collector_graph_walk
time:   0.000; rss: 1141MB -> 1141MB (   +0MB)	partition_and_assert_distinct_symbols
time:   0.497; rss: 1067MB -> 1053MB (  -13MB)	generate_crate_metadata
time:   0.018; rss: 1067MB -> 1076MB (   +9MB)	LLVM_passes
time:   0.012; rss: 1053MB -> 1076MB (  +23MB)	codegen_to_LLVM_IR
time:   0.034; rss: 1053MB -> 1076MB (  +23MB)	codegen_crate
time:   0.001; rss: 1076MB -> 1076MB (   +0MB)	incr_comp_persist_dep_graph
time:   0.159; rss: 1070MB -> 1069MB (   -1MB)	encode_query_results
time:   0.169; rss: 1072MB -> 1069MB (   -3MB)	incr_comp_serialize_result_cache
time:   0.169; rss: 1075MB -> 1069MB (   -6MB)	incr_comp_persist_result_cache
time:   0.170; rss: 1076MB -> 1069MB (   -7MB)	serialize_dep_graph
time:   0.085; rss: 1069MB ->  651MB ( -418MB)	free_global_ctxt
time:   0.089; rss:  651MB ->  686MB (  +35MB)	link_rlib
time:   0.095; rss:  651MB ->  686MB (  +35MB)	link_binary
time:   0.097; rss:  651MB ->  652MB (   +0MB)	link_crate
time:   0.099; rss:  651MB ->  652MB (   +0MB)	link
time:  56.551; rss:   31MB ->  189MB ( +158MB)	total
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 56.75s
	Command being timed: "cargo +beta build -j1 --all-features"
	User time (seconds): 44.59
	System time (seconds): 11.99
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:56.86
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 14866168
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 770
	Minor (reclaiming a frame) page faults: 3674975
	Voluntary context switches: 1129
	Involuntary context switches: 1047
	Swaps: 0
	File system inputs: 300440
	File system outputs: 211624
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 18, 2024
@lqd
Copy link
Member

lqd commented Dec 18, 2024

It looks unrelated then, and using a mixed bitset is enough to fix the issue for me.

I haven’t looked into it either. It was a possibility because of the 200K memory allocations for dataflow in const qualif: iirc on acyclic cfgs, some analyses can be run in a single pass on RPO (again I haven’t checked that this case is acyclic or that this analysis could do this optimization )

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 20, 2024
…-errors

Use `MixedBitSet`s in const qualif

These analyses' domains should be very homogeneous, having compressed bitmaps on huge cfgs should make a difference (and doesn’t have an impact on the smaller / regular cfgs in our benchmarks).

This is a >40% walltime reduction on [this stress test](https://github.com/Manishearth/icu4x_compile_sample) extracted from a real world ICU case, and a 10x or so max-rss reduction.

cc `@oli-obk` `@RalfJung`

Should help with (or fix) issue rust-lang#134404.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants