convenient self-profile option? #50780

nikomatsakis · 2018-05-15T19:22:11Z

It'd be really nice if there were an easy way to breakdown compilation time into a few categories, and to get a summary of reuse:

Parsing/expansion
Type-checking, borrow-checking
Codegen (aka trans)
LLVM + linking

Something that we could conveniently copy and paste into bug reports to get an overall sense of what is going. This would replace and/or build on the existing profile options. (Even better would be if we could then have an option to get per-query data, later on.)

cc @nnethercote @michaelwoerister

nnethercote · 2018-05-15T23:39:00Z

Excluding the reuse part, isn't this what -Ztime-passes is for?

I've been told that profiling queries are meant to replace -Ztime-passes, but when I followed the instructions at https://forge.rust-lang.org/profile-queries.html I couldn't get it to work -- it produced empty files.

Here are some ways that I think -Ztime-passes could be improved:

Fewer buckets. There are too many that take up almost no time, and just clutter the output.
Percentages. I don't like having to constantly do percentage math (this stage took 3.7s out of 11.1s) in my head. Parallelism complicates this.
The nesting is unclear. It took me ages to realize that the total for nested buckets shows up below them.
Parellelism messes up the ordering.

A lot of improvements could be made by recording data and then having a post-processing step, which could aggregate insignificant buckets, for example. I don't know whether that post-processing should happen within rustc itself or be a separate tool.

nikomatsakis · 2018-05-17T17:33:19Z

@nnethercote I think that "time-passes" and the profile-queries concepts probably ought to be merged, and renamed to something like -Zself-profile. One problem with time-passes is that the compiler is moving towards a model where it doesn't have passes — but we're not there yet. This means that the implementation model of time-passes is pretty broken: it is based around some top-level calls, but in fact those calls can sort of unreliably include a bunch of unrelated work. For example, invoking the MIR borrow checker may also construct MIR, but it's probably not a good idea to account those the same. This is what profile-queries was meant to fix, but as you observed, the impl is broken. I think we should probably fix profile-queries and simplify it — for example, it doesn't seem like we've really had much need for the HTML-based output it produces, and we'd rather just kind of get a flat list of which queries are taking up time. (And yes it'd be nice to group those queries, because often we don't need that much detail.)

I think my preference would be that when we declare queries, we also declare a "phase" or something that is supposed to be used for a coarse-grained dump (and then you could use -Zfull-profile or something if you want the full details).

wesleywiser · 2018-05-17T18:48:12Z

I'd like to work on this.

michaelwoerister · 2018-05-18T08:30:20Z

It would be great if we had a solution that can deal well with parallelism, potentially also showing points of contention. We have the (possibly bitrotted) -Ztrans_time_graph option which generates a diagram that is rather crude but was indispensable for implementing the scheduler correctly. There was also some discussion about generalizing this and using a known output format that can be visualized in existing tools here: #43804

That being said, I don't think we should try to fix any of the existing solutions but build a new one. It should support at least a pure text summary format that can easily be collected by users and pasted in bug reports. An additional, more detailed output format that can be easily processed by other tools would be great too.

@nnethercote, are there any lessons to be learned from FF's profiling support? Or the valgrind-based tools?

Also cc the whole @rust-lang/wg-compiler-performance

nnethercote · 2018-05-26T01:00:23Z

Apologies for the slow reply.

Text-based output is indeed extremely convenient for pasting into bugs, etc., so I recommend it whenever it's appropriate. The simplest thing to do is just dump out the final text directly from rustc, as -Ztime-passes currently does, but that's inflexible.

The alternative is to write data to a file and then have a separate viewing tool. If the data has a tree-like structure (as -Ztime-passes already does) then the ability to expand/collapse subtrees is very useful. Firefox's about:memory is a good example of this. Here's some sample output:

397.53 MB (100.0%) -- explicit
├──108.74 MB (27.35%) ── heap-unclassified
├───63.97 MB (16.09%) -- js-non-window
│   ├──39.85 MB (10.02%) -- zones
│   │  ├──35.94 MB (09.04%) -- zone(0x114a80000)
│   │  │  ├──18.46 MB (04.64%) ++ (36 tiny)
│   │  │  └──17.48 MB (04.40%) -- realm([System Principal], shared JSM global)
│   │  │     ├───8.81 MB (02.22%) ++ classes
│   │  │     └───8.67 MB (02.18%) ++ (11 tiny)

It's all text so it can be copy and pasted easily. (The lines are Unicode light box-drawing chars \u2500, \u2502, \u2514, \u251c, which usually render nicely in terminals and text boxes.) But it's also interactive: when viewing it within Firefox you can click on any of those lines that have "++" or "--" to expand or collapse a sub-tree. Also, by default it only shows as expanded the sub-trees that exceed a certain threshold; this is good because you can collect arbitrarily detailed data (i.e. as many buckets, sub-buckets, sub-sub-buckets, etc., as you like) without it being overwhelming.

I would recommend using a web browser as your display tool. One possibility is to generate an HTML page(s) directly from rustc, as is done by the profile-queries. Another possibility is to generate some other intermediate format (JSON?) and then write a webpage/website that reads that in, as is done by rustc-perf's site. The latter approach might be more flexible.

wesleywiser · 2018-05-30T03:24:15Z

I've got a very basic version of this here. Here's a sample of the output:

wesley@endurance:~/code/rust/rustc-serialize> RUSTFLAGS="-Zself-profile" cargo build
   Compiling rustc-serialize v0.3.24 (file:///home/wesley/code/rust/rustc-serialize)
Self profiling results for rustc_serialize:
Parsing         		 27             
Expansion       		 132            
TypeChecking    		 2199           
	1597268 hits 1666449 queries
BorrowChecking  		 150            
	10064 hits 14436 queries
Codegen         		 2098           
	44011 hits 54580 queries
Linking         		 17             
	4724 hits 5660 queries
Other           		 601            
	2389409 hits 2461645 queries

Optimization level: No
Incremental: on
    Finished dev [unoptimized + debuginfo] target(s) in 5.36 secs

It might be nice to have a category for "incremental overhead" such as hashing & serializing/deserializing the graph? Perhaps Parsing could also output the number of lines parsed to give a sense for the size of the crate? Are there any other rustc options that would be useful to include in the output?

michaelwoerister · 2018-05-30T06:34:05Z

That looks very cool! Some feedback:

It would be good to display the time unit (are these ms?).
Is hits cache hits? It might be more readable to display a ratio instead of absolute numbers.
Some other rustc options take the --verbose flag into account. This could be used here to control how much data this provides.
Having more data about incremental would be great. Some of the aggregate data from -Zincremental-info could also be shown here (like CGU re-use, red vs green nodes, etc).

nikomatsakis · 2018-05-30T14:58:03Z

This is awesome work @wesleywiser! 💯

I agree with @michaelwoerister's suggestion that we include ms or some other unit. I also think it'd be nice to display the data more "tabularly". Something like this:

| Phase                  | Time (ms) | Queries | Hits (%) |
| -----                  | --------- | ------- | -------- |
| Parsing                | 27        |         |          |
| Expansion              | 132       |         |          |
| TypeChecking           | 2199      | 1666449 | 95.8     |
| BorrowChecking         | 150       | 14436   | 69.7     |
| Codegen                | 2098      | 54580   | 80.6     |
| Linking                | 17        | 5660    | 83.5     |
| Other                  | 601       | 2461645 | 97.1     |

I've gone ahead and made it valid Markdown, too, so that it can easily be pasted for nice display =)

Phase	Time (ms)	Queries	Hits (%)
Parsing	27
Expansion	132
TypeChecking	2199	1666449	95.8
BorrowChecking	150	14436	69.7
Codegen	2098	54580	80.6
Linking	17	5660	83.5
Other	601	2461645	97.1

Finally, I think we display the time as a percentage of the total for each phase.

As a last point, I think it'd be great if — to determine the total time — we do a single measurement from the "outside". This will help us to ensure we don't have "leakage" — that is, time that is not accounted for. Or at least it will give us an idea how much leakage there is.

wesleywiser · 2018-05-30T15:09:24Z

Thanks for the feedback @nikomatsakis and @michaelwoerister! Those are all great ideas and I'll start working on those.

nnethercote · 2018-05-30T23:20:47Z

So it looks like the output is being produced directly by rustc, is that right?

wesleywiser · 2018-05-30T23:37:43Z

@nnethercote Yes, that's correct.

wesleywiser · 2018-06-06T03:16:54Z

Progress update: I've implemented much of the above feedback. The output now looks like this:

wesley@endurance:~/code/rust/rustc-serialize> RUSTFLAGS="-Zself-profile -Zprofile-json" cargo build
   Compiling rustc-serialize v0.3.24 (file:///home/wesley/code/rust/rustc-serialize)
Self profiling results for rustc_serialize:

| Phase            | Time (ms)      | Queries        | Hits (%) |
| ---------------- | -------------- | -------------- | -------- |
| Parsing          | 27             |                |          |
| Expansion        | 132            |                |          |
| TypeChecking     | 2182           | 3321830        | 96.17    |
| BorrowChecking   | 144            | 28872          | 69.71    |
| Codegen          | 2065           | 110571         | 79.61    |
| Linking          | 17             | 11320          | 83.46    |
| Other            | 590            | 4922566        | 97.07    |

Optimization level: No
Incremental: on
    Finished dev [unoptimized + debuginfo] target(s) in 5.29 secs

In addition, -Zprofile-json causes a file containing the timing data to be created:

wesley@endurance:~/code/rust/rustc-serialize> cat self_profiler_results.json 
[
            {
                "category": "Parsing",
                "time_ms": 27
            },
            {
                "category": "Expansion",
                "time_ms": 132
            },
            {
                "category": "TypeChecking",
                "time_ms": 2182
            },
            {
                "category": "BorrowChecking",
                "time_ms": 144
            },
            {
                "category": "Codegen",
                "time_ms": 2065
            },
            {
                "category": "Linking",
                "time_ms": 17
            },
            {
                "category": "Other",
                "time_ms": 590
            }
        ]

nnethercote · 2018-06-06T06:43:46Z

One suggestion: the JSON file should contain enough data to exactly recreate the -Zself-profile output. Currently it appears to lack the information to recreate:

the "Self profiling results for rustc_serialize:" line;
the "Optimization level: No" line;
the "Incremental: on" line;
the "Queries" and "Hits" columns.

If it's possible to get the full command line that rustc was invoked with, I suggest including that in the JSON output too. That way if you end up with a bunch of different JSON files you have full data on how each one was created.

Finally, whenever you have coarse categories like this, the natural inclination is for users to want more detail. "Codegen takes 2065ms... can we break that down some more?" (I experienced this back in the early days of Firefox's about:memory.) This naturally takes us back to the idea of a tree-like structure, where you have categories, sub-categories, sub-sub-categories, etc.

andjo403 · 2018-06-06T17:17:01Z

a while ago I did some experiments with replacing time-passes with chrome tracing events I have updated my branch of rustc now and have some results posted on this page.

a sad thing is that the pages generated only works in chrome.

For some of the events I have extra data to describe the event so it is possible to see why some parts takes more time.

maybe some thing similar can be used for the verbose mode.

nikomatsakis · 2018-06-18T17:44:34Z

@wesleywiser this looks really nice.

Implement a self profiler This is a work in progress implementation of #50780. I'd love feedback on the overall structure and code as well as some specific things: - [The query categorization mechanism](https://github.com/rust-lang/rust/compare/master...wesleywiser:wip_profiling?expand=1#diff-19e0a69c10eff31eb2d16805e79f3437R101). This works but looks kind of ugly to me. Perhaps there's a better way? - [The profiler assumes only one activity can run at a time](https://github.com/rust-lang/rust/compare/master...wesleywiser:wip_profiling?expand=1#diff-f8a403b2d88d873e4b27c097c614a236R177). This is obviously incompatible with the ongoing parallel queries. - [The output code is just a bunch of `format!()`s](https://github.com/rust-lang/rust/compare/master...wesleywiser:wip_profiling?expand=1#diff-f8a403b2d88d873e4b27c097c614a236R91). Is there a better way to generate markdown or json in the compiler? - [The query categorizations are likely wrong](https://github.com/rust-lang/rust/compare/master...wesleywiser:wip_profiling?expand=1#diff-19e0a69c10eff31eb2d16805e79f3437R101). I've marked what seemed obvious to me but I'm sure I got a lot of them wrong. The overhead currently seems very low. Running `perf` on a sample compilation with profiling enabled reveals: ![image](https://user-images.githubusercontent.com/831192/41657821-9775efec-7462-11e8-9e5e-47ec94105d9d.png)

wesleywiser · 2020-03-12T14:40:12Z

I believe this can be closed now. We have the -Zself-profile option, tools to process that data and documentation on how to do so.

nikomatsakis added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance labels May 15, 2018

wesleywiser mentioned this issue Jun 20, 2018

Implement a self profiler #51657

Merged

wesleywiser self-assigned this Aug 15, 2018

XAMPPRocky added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Oct 2, 2018

wesleywiser closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convenient self-profile option? #50780

convenient self-profile option? #50780

nikomatsakis commented May 15, 2018

nnethercote commented May 15, 2018

nikomatsakis commented May 17, 2018

wesleywiser commented May 17, 2018

michaelwoerister commented May 18, 2018

nnethercote commented May 26, 2018

wesleywiser commented May 30, 2018

michaelwoerister commented May 30, 2018

nikomatsakis commented May 30, 2018

wesleywiser commented May 30, 2018

nnethercote commented May 30, 2018

wesleywiser commented May 30, 2018

wesleywiser commented Jun 6, 2018

nnethercote commented Jun 6, 2018

andjo403 commented Jun 6, 2018

nikomatsakis commented Jun 18, 2018

wesleywiser commented Mar 12, 2020

convenient self-profile option? #50780

convenient self-profile option? #50780

Comments

nikomatsakis commented May 15, 2018

nnethercote commented May 15, 2018

nikomatsakis commented May 17, 2018

wesleywiser commented May 17, 2018

michaelwoerister commented May 18, 2018

nnethercote commented May 26, 2018

wesleywiser commented May 30, 2018

michaelwoerister commented May 30, 2018

nikomatsakis commented May 30, 2018

wesleywiser commented May 30, 2018

nnethercote commented May 30, 2018

wesleywiser commented May 30, 2018

wesleywiser commented Jun 6, 2018

nnethercote commented Jun 6, 2018

andjo403 commented Jun 6, 2018

nikomatsakis commented Jun 18, 2018

wesleywiser commented Mar 12, 2020