Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up parsing #2519

Merged
merged 8 commits into from
Nov 23, 2024
Merged

Speed up parsing #2519

merged 8 commits into from
Nov 23, 2024

Conversation

sesse
Copy link

@sesse sesse commented Nov 1, 2024

This patch series speeds up ninja parsing (as measured by a no-op Chromium build) by about 40–50%. The main win is reducing allocation rate by punting StringPiece allocation to a arena/bump allocator, but we also switch out the hash table and hash functions in use.

The series might seem large, but a) most of it is vendoring EmHash8 and rapidhash, and b) most of the rest is just piping an arena pointer through to the various functions and tests.

@sesse sesse force-pushed the master branch 2 times, most recently from f96c667 to 19396e6 Compare November 1, 2024 12:40
@digit-google
Copy link
Contributor

Thanks, cool ideas in there. While I didn't look into it for very long, a few remarks:

  • When adding new classes, or new metthods, provide unit-tests for them.

  • Please use a "_" suffix when adding new class members (e.g. ins_, outs_, validations_ instead of ins, outs, validations).

  • Put third-party headers under src/third_party/xxx to make their origin clearer. Add a README.ninja file as well indicating the original URL, SPDX License tag, etc.. If patches were applied, there must be some way to show that to make future updates easier (e.g. Chromium usually provides a patches/ sub-directory that contains the patches applied to the local copy).

  • Separating these into multiple PRs will help them get accepted quicker.

  • StringPiece::Persist() should be Arena::PersistStringPiece(). The StringPiece class should not know or depend on Arena at all.

  • I don't think the log version needs to be changed if the hash is computed differently. @jhasse, what do you think?

  • Make the arena a member of the State class, which will remove 90% of the code you added to send its pointer to various layers of the code, and make everything much easier to understand. It looks like there are no benefit for using an Arena whose lifecycle exceeds that of State, but let me know if I'm wrong.

  • Would appreciate measurements related to peak RAM usage before/after the patch is applied for large build graphs. I'll try to get these numbers myself for the Fuchsia build (which is about 10x larger than the Chromium one) next week.

@nico
Copy link
Collaborator

nico commented Nov 1, 2024

I was about to hit merge when I saw digit's comments. This is all small self-contained commits; I thought the PR is in pretty good shape. (In particular, I don't think it necessarily needs to be split. Splitting it wouldn't hurt either ofc, but it's more work, and I'd rather merge this in this form that not merging it due to us not merging it over review back-and-forth.)

The third_party suggestion is good.

I think a hash change does need a manifest version bump.

If the arena State suggestion works out, that's also a good suggestion.

Sending the hash map mingw fix upstream would be nice (but not a blocker).

Thanks for the PR!

@digit-google
Copy link
Contributor

Oh, and this fails to build because I didn't add arena.cc to either configure.py or CMakeLists.txt !!

@digit-google
Copy link
Contributor

After fixing this manually, I am also seeing multiple failures in BuildLog unit-tests. Please take care of these as well.

src/arena.h Outdated Show resolved Hide resolved
src/arena.h Outdated

struct Arena {
public:
static constexpr size_t kAlignTarget = sizeof(uint64_t);
Copy link
Contributor

@digit-google digit-google Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regretabbly a static constexpr size_t member will need an empty declaration in the corresponding .cc file, or some compilers will complain in debug mode. But this definition seems to only be used inside of Alloc(), do you really it, i.e. could you turn that into a simple function local variable?

src/arena.cc Outdated
size_t to_allocate = std::max(next_size_, num_bytes);

Block new_block;
new_block.mem.reset(new char[to_allocate]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Used an aligned new here since there is technically no guarantee that this will happen (yes, I know most allocators would use size_t as a minimum). For now the arena is only used to store character sequences so this doesn't matter, but this could become problematic if it is used later to allocate other things.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires C++17, no? Is that allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In theory you should be able to use a dedicated type with alignas() specifier, and use this in a new AlignedType[(to_allocate + sizeof(AlignedType) - 1) & ~sizeof(AlignedType)]), but this is getting ugly, and frankly there is no need for this feature here. What do you think about dropping the alignment requirement entirely, and just call this StringPieceArena for clarity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arena is taken out now, so we'll deal with it in a separate PR if we get there.

src/arena.cc Outdated Show resolved Hide resolved
src/arena.h Outdated Show resolved Hide resolved
src/eval_env.cc Outdated Show resolved Hide resolved
@sesse
Copy link
Author

sesse commented Nov 4, 2024

I think everything should be taken care of now, short of the alignment issue.

For the memory usage, I wonder if we should consider making a short-string optimization to StringPiece()? It should be possible to have a 15-byte string inline, without allocating anything on the arena. We could then also possibly bring back the concatenation optimization in EvalString, so that if you do AddString("foo"); AddString("bar"); you get a single short-string StringPiece with 6 bytes in them.

@digit-google
Copy link
Contributor

I think StringPiece is really designed to be a view over content (just like std::string_view). What you describe would be a completely different type, but that's a interesting idea. Maybe for another PR?

Since we're talking about StringPiece / string lookup performance, the Android Ninja fork has introduced a HashedStringView type that embeds a (pointer + 32-bit size + 32-bit hash) to speed up hash table lookups considerably (with a corresponding HashedString type, which is an std::string + 32-bit hash).

I had ported the idea on top of my fork some time ago, and usint this improved things noticeably (between 5% and 11% for no-op Ninja builds, depending on the build graph). This could be another way to improve speed, at the cost of extra complexity though, and historically Ninja maintainers have been very reluctant of such changes.

It would be nice to know if @nico and @jhasse would be ok with this before trying to implement these schemes.

@sesse
Copy link
Author

sesse commented Nov 4, 2024

I looked a bit more at this; the issue isn't that we have a lot of short stirngs (we don't). it is simply that EvalString doesn't need to live past the end of ManifestParser::ParseDefault(). So we could simply have a small arena that lives only in ManifestParser, and is cleared (with memory available for reuse) at the end of ParseEdge(). It would be a little more complex, but we would probably get rid of all of the memory bloat. What do you think?

@sesse
Copy link
Author

sesse commented Nov 4, 2024

I see some autobuilders are failing, too:

  • Linux / Fedora complains that rapidhash.h has trailing spaces. Should I fix this?
  • macOS seems to build in C++03 mode, so e.g. constexpr is not allowed. This seems to conflict with the desire to use C++11 range-based for loops in this review?
  • Linux / build complains about spelling errors in third-party code. What about this one?

@digit-google
Copy link
Contributor

Thanks a ton @sesse. I have tried your latest patch with a moderate Fuchsia build plan (e.g. fx set core.x64 --with //bundles/tests) and there is a great improvement in no-op time (e.g. from 11.6s to 10.8s, i.e. -7%). On the other hand, peal RAM usage (measured with /usr/bin/time) goes from 1.7 GiB to 2.4 GiB which really considerable (+40%).

I wanted to see how much the non-arena related improvements impacted performance, and after a small rebase, I got a newer version of your patchset (see attached patch) that actually runs slightly faster, without increasing memory at all. So it looks like the arena is not helping after all.

0001-Remove-arena-from-pr-2519-commit-790f571.patch.zip

@digit-google
Copy link
Contributor

The MacOS compilation issue seems to be a bug, this should definitely compile as C++11. I think this is entirely unrelated to this Cl. This is bad :-(

For the spelling issue, the codespell invocation in .github/workflows/linux.yml needs to be updated to add a command-line option, probably something like --skip="src/third_party/*" or similar.

For linting, misc/ci.py needs to be modified to skip src/third_party as well.

@sesse
Copy link
Author

sesse commented Nov 4, 2024

I definitely see differences between the arena and non-arena; there's a measurement there right at the first patch. But like I said, maybe the arena can do with a smaller scope/lifetime.

We can review the non-arena parts first, and then come back to it after the other stuff is in? I'm not that keen on all the re-measuring to get the commit messages right, though.

@digit-google
Copy link
Contributor

Looking at the MacOS issue, it looks like the CMakeLists.txt only forces C++11 when compiling libninja sources. For anything else, e.g. tests and binaries, the compiler's defaults are being used. I guess adding -DCMAKE_CXX_STANDARD=11 in the CMake configuration command might help).

@digit-google
Copy link
Contributor

Let's try to fix the third_party/ and MacOS issues in a separate PR to get it approved more quickly... I'll prepare one.

@digit-google
Copy link
Contributor

FYI: I have uploaded my rebase at https://github.com/digit-google/ninja/pull/new/sesse-ninja-pr2519-790f571-without-arena if that can help you (you do not have to use it, and I want you to get full credit for this work, just to be clear).

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 5.76 to 5.48 seconds.
@sesse
Copy link
Author

sesse commented Nov 4, 2024

I took out the arena (I intend to re-measure it after we have all of the other stuff in). It was my own rebase, but I looked to yours for confirmation in a couple of places, so that was useful; thanks.

@digit-google
Copy link
Contributor

Arrgh, modifying misc/ci.py is quite easy, but it seems there is no way to tell codespell to skip a given directory, only filename globs are supported in the skip: section (or the --skip command-line argument).

@jhasse
Copy link
Collaborator

jhasse commented Nov 5, 2024

one of them was Windows line endings, and I doubt rapidhash would take in such a change.

You don't know until you try :)

Where do the changes in lexer.cc come from?

Technically (unlike with e.g. the zlib license) the MIT license is infectious, meaning that we would have to distribute the copyright and the license terms with binary distributions of ninja after this PR. 99% of people aren't aware of this and break this rule all the time though, not sure if it really matters.
Also not sure about BSD-2-Clause.

@sesse
Copy link
Author

sesse commented Nov 5, 2024

I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-)

src/deps_log.cc Outdated Show resolved Hide resolved
@jhasse
Copy link
Collaborator

jhasse commented Nov 5, 2024

I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-)

Ah yes, I think building via Python changes it in-place using re2c. Can you remove that change from your commit? Only needed when modifying lexer.in.cc.

@sesse
Copy link
Author

sesse commented Nov 5, 2024

I don't really trust hyperfine's statistics, but since you wanted comparative measurements:

Benchmark 1: ./ninja-old -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      3.097 s ±  0.010 s    [User: 2.745 s, System: 0.351 s]
  Range (min … max):    3.089 s …  3.113 s    5 runs
 
Benchmark 2: ./ninja -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      2.122 s ±  0.009 s    [User: 1.730 s, System: 0.391 s]
  Range (min … max):    2.115 s …  2.136 s    5 runs
 
Summary
  ./ninja -C ~/chromium/src/out/Default -n --quiet ran
    1.46 ± 0.01 times faster than ./ninja-old -C ~/chromium/src/out/Default -n --quiet

This is on a 5950X (Zen 3), with a fairly normal NVMe SSD and Debian unstable.

Steinar H. Gunderson added 4 commits November 5, 2024 18:11
This very often holds only a single RAW token, so we do not
need to allocate elements on an std::vector for it in the
common case.

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 5.48 to 5.14 seconds.

Note that this opens up for a potential optimization where
EvalString::Evaluate() could just return a StringPiece, without
making a std::string out of it (which requires allocation; this is
about 5% of remaining runtime). However, this would also require
that CanonicalizePath() somehow learned to work with StringPiece
(presumably allocating a new StringPiece if and only if changes
were needed).
This is much faster than std::unordered_map, and also slightly faster
than phmap::flat_hash_map that was included in PR ninja-build#2468.
It is MIT-licensed, and we just include the .h file wholesale.

I haven't done a detailed test of all the various unordered_maps
out there, but this is the overall highest-ranking contender on

  https://martin.ankerl.com/2022/08/27/hashmap-bench-01/

except for ankerl::unordered_dense::map, which requires C++17.

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 5.14 to 4.62 seconds.
@sesse
Copy link
Author

sesse commented Nov 5, 2024

I took out the lexer.cc changes.

@jhasse
Copy link
Collaborator

jhasse commented Nov 7, 2024

Can you benchmark only the hasmap change against MSVC's std::unordered_map and libc++ (e.g. on macOS)?

@sesse
Copy link
Author

sesse commented Nov 7, 2024

I haven't really had a Windows machine since 2001 or so, so that's a bit tricky :-) I have a Mac at work, so I can make the test there next week. Or I can probably make a test with libc++ on Linux (with Clang) if that works?

@jhasse
Copy link
Collaborator

jhasse commented Nov 7, 2024

Yes, would be even more interesting to have a direct comparison on the same system between libstdc++ and libc++ :)

@sesse
Copy link
Author

sesse commented Nov 7, 2024

Linux, still 5950X:

Benchmark 1: ./ninja-libstdc++ -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      5.262 s ±  0.095 s    [User: 4.153 s, System: 1.108 s]
  Range (min … max):    5.094 s …  5.317 s    5 runs
 
Benchmark 2: ./ninja-libc++ -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      5.192 s ±  0.101 s    [User: 4.097 s, System: 1.093 s]
  Range (min … max):    5.080 s …  5.275 s    5 runs
 
Benchmark 3: ./ninja-libstdc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      4.913 s ±  0.066 s    [User: 3.808 s, System: 1.104 s]
  Range (min … max):    4.800 s …  4.973 s    5 runs
 
Benchmark 4: ./ninja-libc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
  Time (mean ± σ):      4.815 s ±  0.101 s    [User: 3.720 s, System: 1.094 s]
  Range (min … max):    4.752 s …  4.990 s    5 runs
 
Summary
  ./ninja-libc++-emhash8 -C ~/chromium/src/out/Default -n --quiet ran
    1.02 ± 0.03 times faster than ./ninja-libstdc++-emhash8 -C ~/chromium/src/out/Default -n --quiet
    1.08 ± 0.03 times faster than ./ninja-libc++ -C ~/chromium/src/out/Default -n --quiet
    1.09 ± 0.03 times faster than ./ninja-libstdc++ -C ~/chromium/src/out/Default -n --quiet

@sesse
Copy link
Author

sesse commented Nov 13, 2024

@jhasse Is there anything missing here for this to be merged?

@jhasse
Copy link
Collaborator

jhasse commented Nov 15, 2024

Not from my side, I just remembered that I hate Windows anyway :D
I'm also not using ninja on such big projects, so I'm personally completely unaffected by this ...

Maybe @nico and @digit-google have some comments left though :) Let's wait a bit, then I'll merge.

@jhasse jhasse added this to the 1.13.0 milestone Nov 15, 2024
@sesse
Copy link
Author

sesse commented Nov 15, 2024

Great :-) I still have the arena PR, but we'll deal with that after this.

@digit-google
Copy link
Contributor

This is a great PR, please ship it asap :-)

For the record, I have added below my own Linux benchmarks for three versions of the Ninja sources, and several compiler + c++ runtime and memory allocator combos, run against a small Fuchsia build plan, where:

  • ninja-upstream-* is for current Ninja sources.
  • ninja-sesse-* are the changes in the PR without the emhash8 changes
  • ninja-emhash8-* are everything in this PR.

This shows that ninja-sesse-* is consistently better than ninja-upstream-*, and ninja-emhash8-* is consistently better than ninja-sesse-*, even when using custom allocators.

Benchmark 1: /tmp/ninja-upstream-clang16-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.862 s ±  0.022 s    [User: 2.727 s, System: 2.124 s]
  Range (min … max):    4.834 s …  4.895 s    10 runs
 
Benchmark 2: /tmp/ninja-sesse-clang16-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.503 s ±  0.018 s    [User: 2.391 s, System: 2.100 s]
  Range (min … max):    4.465 s …  4.524 s    10 runs
 
Benchmark 3: /tmp/ninja-emhash8-clang16-glibc-libstdc++ -C out/default -n --quiet
  Time (mean ± σ):      4.407 s ±  0.021 s    [User: 2.267 s, System: 2.129 s]
  Range (min … max):    4.358 s …  4.437 s    10 runs
 
Benchmark 4: /tmp/ninja-upstream-clang20-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.786 s ±  0.026 s    [User: 2.665 s, System: 2.109 s]
  Range (min … max):    4.725 s …  4.808 s    10 runs
 
Benchmark 5: /tmp/ninja-sesse-clang20-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.439 s ±  0.029 s    [User: 2.320 s, System: 2.107 s]
  Range (min … max):    4.398 s …  4.485 s    10 runs
 
Benchmark 6: /tmp/ninja-emhash8-clang20-glibc-libstdc++ -C out/default -n --quiet
  Time (mean ± σ):      4.307 s ±  0.018 s    [User: 2.208 s, System: 2.088 s]
  Range (min … max):    4.277 s …  4.330 s    10 runs
 
Benchmark 7: /tmp/ninja-emhash8-clang20-glibc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      4.319 s ±  0.015 s    [User: 2.200 s, System: 2.107 s]
  Range (min … max):    4.287 s …  4.336 s    10 runs
 
Benchmark 8: /tmp/ninja-upstream-clang20-jemalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      4.021 s ±  0.020 s    [User: 2.071 s, System: 1.940 s]
  Range (min … max):    3.988 s …  4.046 s    10 runs
 
Benchmark 9: /tmp/ninja-sesse-clang20-jemalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.801 s ±  0.019 s    [User: 1.859 s, System: 1.931 s]
  Range (min … max):    3.772 s …  3.837 s    10 runs
 
Benchmark 10: /tmp/ninja-emhash8-clang20-jemalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.733 s ±  0.015 s    [User: 1.796 s, System: 1.927 s]
  Range (min … max):    3.706 s …  3.754 s    10 runs
 
Benchmark 11: /tmp/ninja-upstream-clang20-rpmalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      4.068 s ±  0.022 s    [User: 2.120 s, System: 1.937 s]
  Range (min … max):    4.033 s …  4.111 s    10 runs
 
Benchmark 12: /tmp/ninja-sesse-clang20-rpmalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.827 s ±  0.011 s    [User: 1.882 s, System: 1.932 s]
  Range (min … max):    3.802 s …  3.843 s    10 runs
 
Benchmark 13: /tmp/ninja-emhash8-clang20-rpmalloc-libc++ -C out/default -n --quiet
  Time (mean ± σ):      3.781 s ±  0.015 s    [User: 1.820 s, System: 1.948 s]
  Range (min … max):    3.759 s …  3.799 s    10 runs
 
Benchmark 14: /tmp/ninja-upstream-gcc13-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.823 s ±  0.035 s    [User: 2.666 s, System: 2.145 s]
  Range (min … max):    4.759 s …  4.865 s    10 runs
 
Benchmark 15: /tmp/ninja-sesse-gcc13-glibc-stdlibc++ -C out/default -n --quiet
  Time (mean ± σ):      4.488 s ±  0.021 s    [User: 2.305 s, System: 2.172 s]
  Range (min … max):    4.450 s …  4.513 s    10 runs
 
Benchmark 16: /tmp/ninja-emhash8-gcc13-glibc-libstdc++ -C out/default -n --quiet
  Time (mean ± σ):      4.383 s ±  0.030 s    [User: 2.203 s, System: 2.168 s]
  Range (min … max):    4.341 s …  4.424 s    10 runs
 
Summary
  /tmp/ninja-emhash8-clang20-jemalloc-libc++ -C out/default -n --quiet ran
    1.01 ± 0.01 times faster than /tmp/ninja-emhash8-clang20-rpmalloc-libc++ -C out/default -n --quiet
    1.02 ± 0.01 times faster than /tmp/ninja-sesse-clang20-jemalloc-libc++ -C out/default -n --quiet
    1.03 ± 0.01 times faster than /tmp/ninja-sesse-clang20-rpmalloc-libc++ -C out/default -n --quiet
    1.08 ± 0.01 times faster than /tmp/ninja-upstream-clang20-jemalloc-libc++ -C out/default -n --quiet
    1.09 ± 0.01 times faster than /tmp/ninja-upstream-clang20-rpmalloc-libc++ -C out/default -n --quiet
    1.15 ± 0.01 times faster than /tmp/ninja-emhash8-clang20-glibc-libstdc++ -C out/default -n --quiet
    1.16 ± 0.01 times faster than /tmp/ninja-emhash8-clang20-glibc-libc++ -C out/default -n --quiet
    1.17 ± 0.01 times faster than /tmp/ninja-emhash8-gcc13-glibc-libstdc++ -C out/default -n --quiet
    1.18 ± 0.01 times faster than /tmp/ninja-emhash8-clang16-glibc-libstdc++ -C out/default -n --quiet
    1.19 ± 0.01 times faster than /tmp/ninja-sesse-clang20-glibc-stdlibc++ -C out/default -n --quiet
    1.20 ± 0.01 times faster than /tmp/ninja-sesse-gcc13-glibc-stdlibc++ -C out/default -n --quiet
    1.21 ± 0.01 times faster than /tmp/ninja-sesse-clang16-glibc-stdlibc++ -C out/default -n --quiet
    1.28 ± 0.01 times faster than /tmp/ninja-upstream-clang20-glibc-stdlibc++ -C out/default -n --quiet
    1.29 ± 0.01 times faster than /tmp/ninja-upstream-gcc13-glibc-stdlibc++ -C out/default -n --quiet
    1.30 ± 0.01 times faster than /tmp/ninja-upstream-clang16-glibc-stdlibc++ -C out/default -n --quiet

@digit-google
Copy link
Contributor

friendly ping for @jhasse .

Steinar H. Gunderson added 3 commits November 21, 2024 10:04
This is the currently fastest hash that passes SMHasher and does not
require special instructions (e.g. SIMD). Like emhash8, it is
liberally licensed (2-clause BSD), and we include the .h file directly.

For a no-op build of Chromium (Linux, Zen 2),
this reduces time spent from 4.62 to 4.22 seconds.
(NOTE: This is a more difficult measurement than the previous ones,
as it necessarily involves removing the entire build log and doing
a clean build. However, just switching the HashMap hash takes
to 4.47 seconds or so.)
ftell() must go ask the kernel for the file offset, in case
someone knew the underlying file descriptor number and seeked it.
Thus, we can save a couple hundred thousand syscalls by just
caching the offset and maintaining it ourselves.

This cuts another ~170ms off a no-op Chromium build.
This cuts off another ~100 ms, most likely because the compiler
doesn't have smart enough alias analysis to do the same (trivial)
transformation.
@jhasse jhasse merged commit c7ad730 into ninja-build:master Nov 23, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants