Speed up parsing #2519

sesse · 2024-11-01T12:22:47Z

This patch series speeds up ninja parsing (as measured by a no-op Chromium build) by about 40–50%. The main win is reducing allocation rate by punting StringPiece allocation to a arena/bump allocator, but we also switch out the hash table and hash functions in use.

The series might seem large, but a) most of it is vendoring EmHash8 and rapidhash, and b) most of the rest is just piping an arena pointer through to the various functions and tests.

digit-google · 2024-11-01T15:21:50Z

Thanks, cool ideas in there. While I didn't look into it for very long, a few remarks:

When adding new classes, or new metthods, provide unit-tests for them.
Please use a "_" suffix when adding new class members (e.g. ins_, outs_, validations_ instead of ins, outs, validations).
Put third-party headers under src/third_party/xxx to make their origin clearer. Add a README.ninja file as well indicating the original URL, SPDX License tag, etc.. If patches were applied, there must be some way to show that to make future updates easier (e.g. Chromium usually provides a patches/ sub-directory that contains the patches applied to the local copy).
Separating these into multiple PRs will help them get accepted quicker.
StringPiece::Persist() should be Arena::PersistStringPiece(). The StringPiece class should not know or depend on Arena at all.
I don't think the log version needs to be changed if the hash is computed differently. @jhasse, what do you think?
Make the arena a member of the State class, which will remove 90% of the code you added to send its pointer to various layers of the code, and make everything much easier to understand. It looks like there are no benefit for using an Arena whose lifecycle exceeds that of State, but let me know if I'm wrong.
Would appreciate measurements related to peak RAM usage before/after the patch is applied for large build graphs. I'll try to get these numbers myself for the Fuchsia build (which is about 10x larger than the Chromium one) next week.

nico · 2024-11-01T15:28:34Z

I was about to hit merge when I saw digit's comments. This is all small self-contained commits; I thought the PR is in pretty good shape. (In particular, I don't think it necessarily needs to be split. Splitting it wouldn't hurt either ofc, but it's more work, and I'd rather merge this in this form that not merging it due to us not merging it over review back-and-forth.)

The third_party suggestion is good.

I think a hash change does need a manifest version bump.

If the arena State suggestion works out, that's also a good suggestion.

Sending the hash map mingw fix upstream would be nice (but not a blocker).

Thanks for the PR!

digit-google · 2024-11-01T15:37:06Z

Oh, and this fails to build because I didn't add arena.cc to either configure.py or CMakeLists.txt !!

digit-google · 2024-11-01T15:41:35Z

After fixing this manually, I am also seeing multiple failures in BuildLog unit-tests. Please take care of these as well.

src/arena.h

digit-google · 2024-11-01T18:04:55Z

src/arena.h

+
+struct Arena {
+ public:
+  static constexpr size_t kAlignTarget = sizeof(uint64_t);


Regretabbly a static constexpr size_t member will need an empty declaration in the corresponding .cc file, or some compilers will complain in debug mode. But this definition seems to only be used inside of Alloc(), do you really it, i.e. could you turn that into a simple function local variable?

digit-google · 2024-11-01T18:06:32Z

src/arena.cc

+  size_t to_allocate = std::max(next_size_, num_bytes);
+
+  Block new_block;
+  new_block.mem.reset(new char[to_allocate]);


nit: Used an aligned new here since there is technically no guarantee that this will happen (yes, I know most allocators would use size_t as a minimum). For now the arena is only used to store character sequences so this doesn't matter, but this could become problematic if it is used later to allocate other things.

That requires C++17, no? Is that allowed?

Good point. In theory you should be able to use a dedicated type with alignas() specifier, and use this in a new AlignedType[(to_allocate + sizeof(AlignedType) - 1) & ~sizeof(AlignedType)]), but this is getting ugly, and frankly there is no need for this feature here. What do you think about dropping the alignment requirement entirely, and just call this StringPieceArena for clarity?

The arena is taken out now, so we'll deal with it in a separate PR if we get there.

src/arena.cc

src/arena.h

src/eval_env.cc

sesse · 2024-11-04T11:34:22Z

I think everything should be taken care of now, short of the alignment issue.

For the memory usage, I wonder if we should consider making a short-string optimization to StringPiece()? It should be possible to have a 15-byte string inline, without allocating anything on the arena. We could then also possibly bring back the concatenation optimization in EvalString, so that if you do AddString("foo"); AddString("bar"); you get a single short-string StringPiece with 6 bytes in them.

digit-google · 2024-11-04T11:53:26Z

I think StringPiece is really designed to be a view over content (just like std::string_view). What you describe would be a completely different type, but that's a interesting idea. Maybe for another PR?

Since we're talking about StringPiece / string lookup performance, the Android Ninja fork has introduced a HashedStringView type that embeds a (pointer + 32-bit size + 32-bit hash) to speed up hash table lookups considerably (with a corresponding HashedString type, which is an std::string + 32-bit hash).

I had ported the idea on top of my fork some time ago, and usint this improved things noticeably (between 5% and 11% for no-op Ninja builds, depending on the build graph). This could be another way to improve speed, at the cost of extra complexity though, and historically Ninja maintainers have been very reluctant of such changes.

It would be nice to know if @nico and @jhasse would be ok with this before trying to implement these schemes.

sesse · 2024-11-04T12:22:46Z

I looked a bit more at this; the issue isn't that we have a lot of short stirngs (we don't). it is simply that EvalString doesn't need to live past the end of ManifestParser::ParseDefault(). So we could simply have a small arena that lives only in ManifestParser, and is cleared (with memory available for reuse) at the end of ParseEdge(). It would be a little more complex, but we would probably get rid of all of the memory bloat. What do you think?

sesse · 2024-11-04T12:26:02Z

I see some autobuilders are failing, too:

Linux / Fedora complains that rapidhash.h has trailing spaces. Should I fix this?
macOS seems to build in C++03 mode, so e.g. constexpr is not allowed. This seems to conflict with the desire to use C++11 range-based for loops in this review?
Linux / build complains about spelling errors in third-party code. What about this one?

digit-google · 2024-11-04T12:50:20Z

Thanks a ton @sesse. I have tried your latest patch with a moderate Fuchsia build plan (e.g. fx set core.x64 --with //bundles/tests) and there is a great improvement in no-op time (e.g. from 11.6s to 10.8s, i.e. -7%). On the other hand, peal RAM usage (measured with /usr/bin/time) goes from 1.7 GiB to 2.4 GiB which really considerable (+40%).

I wanted to see how much the non-arena related improvements impacted performance, and after a small rebase, I got a newer version of your patchset (see attached patch) that actually runs slightly faster, without increasing memory at all. So it looks like the arena is not helping after all.

0001-Remove-arena-from-pr-2519-commit-790f571.patch.zip

digit-google · 2024-11-04T12:56:40Z

The MacOS compilation issue seems to be a bug, this should definitely compile as C++11. I think this is entirely unrelated to this Cl. This is bad :-(

For the spelling issue, the codespell invocation in .github/workflows/linux.yml needs to be updated to add a command-line option, probably something like --skip="src/third_party/*" or similar.

For linting, misc/ci.py needs to be modified to skip src/third_party as well.

sesse · 2024-11-04T12:57:24Z

I definitely see differences between the arena and non-arena; there's a measurement there right at the first patch. But like I said, maybe the arena can do with a smaller scope/lifetime.

We can review the non-arena parts first, and then come back to it after the other stuff is in? I'm not that keen on all the re-measuring to get the commit messages right, though.

digit-google · 2024-11-04T12:59:59Z

Looking at the MacOS issue, it looks like the CMakeLists.txt only forces C++11 when compiling libninja sources. For anything else, e.g. tests and binaries, the compiler's defaults are being used. I guess adding -DCMAKE_CXX_STANDARD=11 in the CMake configuration command might help).

digit-google · 2024-11-04T13:01:38Z

Let's try to fix the third_party/ and MacOS issues in a separate PR to get it approved more quickly... I'll prepare one.

digit-google · 2024-11-04T13:03:53Z

FYI: I have uploaded my rebase at https://github.com/digit-google/ninja/pull/new/sesse-ninja-pr2519-790f571-without-arena if that can help you (you do not have to use it, and I want you to get full credit for this work, just to be clear).

For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.76 to 5.48 seconds.

sesse · 2024-11-04T13:15:04Z

I took out the arena (I intend to re-measure it after we have all of the other stuff in). It was my own rebase, but I looked to yours for confirmation in a couple of places, so that was useful; thanks.

digit-google · 2024-11-04T13:54:39Z

Arrgh, modifying misc/ci.py is quite easy, but it seems there is no way to tell codespell to skip a given directory, only filename globs are supported in the skip: section (or the --skip command-line argument).

jhasse · 2024-11-05T17:01:07Z

one of them was Windows line endings, and I doubt rapidhash would take in such a change.

You don't know until you try :)

Where do the changes in lexer.cc come from?

Technically (unlike with e.g. the zlib license) the MIT license is infectious, meaning that we would have to distribute the copyright and the license terms with binary distributions of ninja after this PR. 99% of people aren't aware of this and break this rule all the time though, not sure if it really matters.
Also not sure about BSD-2-Clause.

sesse · 2024-11-05T17:03:54Z

I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-)

src/deps_log.cc

jhasse · 2024-11-05T17:08:41Z

I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-)

Ah yes, I think building via Python changes it in-place using re2c. Can you remove that change from your commit? Only needed when modifying lexer.in.cc.

sesse · 2024-11-05T17:09:12Z

I don't really trust hyperfine's statistics, but since you wanted comparative measurements:

ninja-old is current master (089af00)
ninja is this PR (a0f10b9)

Benchmark 1: ./ninja-old -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      3.097 s ±  0.010 s    [User: 2.745 s, System: 0.351 s]
  Range (min … max):    3.089 s …  3.113 s    5 runs
 
Benchmark 2: ./ninja -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      2.122 s ±  0.009 s    [User: 1.730 s, System: 0.391 s]
  Range (min … max):    2.115 s …  2.136 s    5 runs
 
Summary
  ./ninja -C ~/chromium/src/out/Default -n --quiet ran
    1.46 ± 0.01 times faster than ./ninja-old -C ~/chromium/src/out/Default -n --quiet

This is on a 5950X (Zen 3), with a fairly normal NVMe SSD and Debian unstable.

This very often holds only a single RAW token, so we do not need to allocate elements on an std::vector for it in the common case. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.48 to 5.14 seconds. Note that this opens up for a potential optimization where EvalString::Evaluate() could just return a StringPiece, without making a std::string out of it (which requires allocation; this is about 5% of remaining runtime). However, this would also require that CanonicalizePath() somehow learned to work with StringPiece (presumably allocating a new StringPiece if and only if changes were needed).

This is much faster than std::unordered_map, and also slightly faster than phmap::flat_hash_map that was included in PR ninja-build#2468. It is MIT-licensed, and we just include the .h file wholesale. I haven't done a detailed test of all the various unordered_maps out there, but this is the overall highest-ranking contender on https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ except for ankerl::unordered_dense::map, which requires C++17. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.14 to 4.62 seconds.

sesse · 2024-11-05T17:12:49Z

I took out the lexer.cc changes.

jhasse · 2024-11-07T16:11:29Z

Can you benchmark only the hasmap change against MSVC's std::unordered_map and libc++ (e.g. on macOS)?

sesse · 2024-11-07T17:31:50Z

I haven't really had a Windows machine since 2001 or so, so that's a bit tricky :-) I have a Mac at work, so I can make the test there next week. Or I can probably make a test with libc++ on Linux (with Clang) if that works?

jhasse · 2024-11-07T20:03:50Z

Yes, would be even more interesting to have a direct comparison on the same system between libstdc++ and libc++ :)

sesse · 2024-11-07T20:39:33Z

Linux, still 5950X:

Benchmark 1: ./ninja-libstdc++ -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      5.262 s ±  0.095 s    [User: 4.153 s, System: 1.108 s]
  Range (min … max):    5.094 s …  5.317 s    5 runs
 
Benchmark 2: ./ninja-libc++ -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      5.192 s ±  0.101 s    [User: 4.097 s, System: 1.093 s]
  Range (min … max):    5.080 s …  5.275 s    5 runs
 
Benchmark 3: ./ninja-libstdc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      4.913 s ±  0.066 s    [User: 3.808 s, System: 1.104 s]
  Range (min … max):    4.800 s …  4.973 s    5 runs
 
Benchmark 4: ./ninja-libc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      4.815 s ±  0.101 s    [User: 3.720 s, System: 1.094 s]
  Range (min … max):    4.752 s …  4.990 s    5 runs
 
Summary
  ./ninja-libc++-emhash8 -C ~/chromium/src/out/Default -n --quiet ran
    1.02 ± 0.03 times faster than ./ninja-libstdc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
    1.08 ± 0.03 times faster than ./ninja-libc++ -C ~/chromium/src/out/Default -n --quiet
    1.09 ± 0.03 times faster than ./ninja-libstdc++ -C ~/chromium/src/out/Default -n --quiet

sesse · 2024-11-13T12:10:28Z

@jhasse Is there anything missing here for this to be merged?

jhasse · 2024-11-15T07:45:05Z

Not from my side, I just remembered that I hate Windows anyway :D
I'm also not using ninja on such big projects, so I'm personally completely unaffected by this ...

Maybe @nico and @digit-google have some comments left though :) Let's wait a bit, then I'll merge.

sesse · 2024-11-15T09:28:28Z

Great :-) I still have the arena PR, but we'll deal with that after this.

digit-google · 2024-11-15T12:18:29Z

This is a great PR, please ship it asap :-)

For the record, I have added below my own Linux benchmarks for three versions of the Ninja sources, and several compiler + c++ runtime and memory allocator combos, run against a small Fuchsia build plan, where:

ninja-upstream-* is for current Ninja sources.
ninja-sesse-* are the changes in the PR without the emhash8 changes
ninja-emhash8-* are everything in this PR.

This shows that ninja-sesse-* is consistently better than ninja-upstream-*, and ninja-emhash8-* is consistently better than ninja-sesse-*, even when using custom allocators.

Benchmark 1: /tmp/ninja-upstream-clang16-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.862 s ±  0.022 s    [User: 2.727 s, System: 2.124 s]
  Range (min … max):    4.834 s …  4.895 s    10 runs
 
Benchmark 2: /tmp/ninja-sesse-clang16-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.503 s ±  0.018 s    [User: 2.391 s, System: 2.100 s]
  Range (min … max):    4.465 s …  4.524 s    10 runs
 
Benchmark 3: /tmp/ninja-emhash8-clang16-glibc-libstdc++ -C out/default -n --quiet
  Time (mean ± σ):      4.407 s ±  0.021 s    [User: 2.267 s, System: 2.129 s]
  Range (min … max):    4.358 s …  4.437 s    10 runs
 
Benchmark 4: /tmp/ninja-upstream-clang20-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.786 s ±  0.026 s    [User: 2.665 s, System: 2.109 s]
  Range (min … max):    4.725 s …  4.808 s    10 runs
 
Benchmark 5: /tmp/ninja-sesse-clang20-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.439 s ±  0.029 s    [User: 2.320 s, System: 2.107 s]
  Range (min … max):    4.398 s …  4.485 s    10 runs
 
Benchmark 6: /tmp/ninja-emhash8-clang20-glibc-libstdc++ -C out/default -n --quiet
  Time (mean ± σ):      4.307 s ±  0.018 s    [User: 2.208 s, System: 2.088 s]
  Range (min … max):    4.277 s …  4.330 s    10 runs
 
Benchmark 7: /tmp/ninja-emhash8-clang20-glibc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      4.319 s ±  0.015 s    [User: 2.200 s, System: 2.107 s]
  Range (min … max):    4.287 s …  4.336 s    10 runs
 
Benchmark 8: /tmp/ninja-upstream-clang20-jemalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      4.021 s ±  0.020 s    [User: 2.071 s, System: 1.940 s]
  Range (min … max):    3.988 s …  4.046 s    10 runs
 
Benchmark 9: /tmp/ninja-sesse-clang20-jemalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.801 s ±  0.019 s    [User: 1.859 s, System: 1.931 s]
  Range (min … max):    3.772 s …  3.837 s    10 runs
 
Benchmark 10: /tmp/ninja-emhash8-clang20-jemalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.733 s ±  0.015 s    [User: 1.796 s, System: 1.927 s]
  Range (min … max):    3.706 s …  3.754 s    10 runs
 
Benchmark 11: /tmp/ninja-upstream-clang20-rpmalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      4.068 s ±  0.022 s    [User: 2.120 s, System: 1.937 s]
  Range (min … max):    4.033 s …  4.111 s    10 runs
 
Benchmark 12: /tmp/ninja-sesse-clang20-rpmalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.827 s ±  0.011 s    [User: 1.882 s, System: 1.932 s]
  Range (min … max):    3.802 s …  3.843 s    10 runs
 
Benchmark 13: /tmp/ninja-emhash8-clang20-rpmalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.781 s ±  0.015 s    [User: 1.820 s, System: 1.948 s]
  Range (min … max):    3.759 s …  3.799 s    10 runs
 
Benchmark 14: /tmp/ninja-upstream-gcc13-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.823 s ±  0.035 s    [User: 2.666 s, System: 2.145 s]
  Range (min … max):    4.759 s …  4.865 s    10 runs
 
Benchmark 15: /tmp/ninja-sesse-gcc13-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.488 s ±  0.021 s    [User: 2.305 s, System: 2.172 s]
  Range (min … max):    4.450 s …  4.513 s    10 runs
 
Benchmark 16: /tmp/ninja-emhash8-gcc13-glibc-libstdc++ -C out/default -n --quiet
  Time (mean ± σ):      4.383 s ±  0.030 s    [User: 2.203 s, System: 2.168 s]
  Range (min … max):    4.341 s …  4.424 s    10 runs
 
Summary
  /tmp/ninja-emhash8-clang20-jemalloc-libc++ -C out/default -n --quiet ran
    1.01 ± 0.01 times faster than /tmp/ninja-emhash8-clang20-rpmalloc-libc++ -C out/default -n --quiet
    1.02 ± 0.01 times faster than /tmp/ninja-sesse-clang20-jemalloc-libc++ -C out/default -n --quiet
    1.03 ± 0.01 times faster than /tmp/ninja-sesse-clang20-rpmalloc-libc++ -C out/default -n --quiet
    1.08 ± 0.01 times faster than /tmp/ninja-upstream-clang20-jemalloc-libc++ -C out/default -n --quiet
    1.09 ± 0.01 times faster than /tmp/ninja-upstream-clang20-rpmalloc-libc++ -C out/default -n --quiet
    1.15 ± 0.01 times faster than /tmp/ninja-emhash8-clang20-glibc-libstdc++ -C out/default -n --quiet
    1.16 ± 0.01 times faster than /tmp/ninja-emhash8-clang20-glibc-libc++ -C out/default -n --quiet
    1.17 ± 0.01 times faster than /tmp/ninja-emhash8-gcc13-glibc-libstdc++ -C out/default -n --quiet
    1.18 ± 0.01 times faster than /tmp/ninja-emhash8-clang16-glibc-libstdc++ -C out/default -n --quiet
    1.19 ± 0.01 times faster than /tmp/ninja-sesse-clang20-glibc-stdlibc++ -C out/default -n --quiet
    1.20 ± 0.01 times faster than /tmp/ninja-sesse-gcc13-glibc-stdlibc++ -C out/default -n --quiet
    1.21 ± 0.01 times faster than /tmp/ninja-sesse-clang16-glibc-stdlibc++ -C out/default -n --quiet
    1.28 ± 0.01 times faster than /tmp/ninja-upstream-clang20-glibc-stdlibc++ -C out/default -n --quiet
    1.29 ± 0.01 times faster than /tmp/ninja-upstream-gcc13-glibc-stdlibc++ -C out/default -n --quiet
    1.30 ± 0.01 times faster than /tmp/ninja-upstream-clang16-glibc-stdlibc++ -C out/default -n --quiet

digit-google · 2024-11-19T09:19:39Z

friendly ping for @jhasse .

src/third_party/rapidhash/README.ninja

This is the currently fastest hash that passes SMHasher and does not require special instructions (e.g. SIMD). Like emhash8, it is liberally licensed (2-clause BSD), and we include the .h file directly. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 4.62 to 4.22 seconds. (NOTE: This is a more difficult measurement than the previous ones, as it necessarily involves removing the entire build log and doing a clean build. However, just switching the HashMap hash takes to 4.47 seconds or so.)

ftell() must go ask the kernel for the file offset, in case someone knew the underlying file descriptor number and seeked it. Thus, we can save a couple hundred thousand syscalls by just caching the offset and maintaining it ourselves. This cuts another ~170ms off a no-op Chromium build.

This cuts off another ~100 ms, most likely because the compiler doesn't have smart enough alias analysis to do the same (trivial) transformation.

sesse force-pushed the master branch 2 times, most recently from f96c667 to 19396e6 Compare November 1, 2024 12:40