-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up parsing #2519
Speed up parsing #2519
Conversation
f96c667
to
19396e6
Compare
Thanks, cool ideas in there. While I didn't look into it for very long, a few remarks:
|
I was about to hit merge when I saw digit's comments. This is all small self-contained commits; I thought the PR is in pretty good shape. (In particular, I don't think it necessarily needs to be split. Splitting it wouldn't hurt either ofc, but it's more work, and I'd rather merge this in this form that not merging it due to us not merging it over review back-and-forth.) The third_party suggestion is good. I think a hash change does need a manifest version bump. If the arena Sending the hash map mingw fix upstream would be nice (but not a blocker). Thanks for the PR! |
Oh, and this fails to build because I didn't add |
After fixing this manually, I am also seeing multiple failures in BuildLog unit-tests. Please take care of these as well. |
src/arena.h
Outdated
|
||
struct Arena { | ||
public: | ||
static constexpr size_t kAlignTarget = sizeof(uint64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regretabbly a static constexpr size_t
member will need an empty declaration in the corresponding .cc file, or some compilers will complain in debug mode. But this definition seems to only be used inside of Alloc(), do you really it, i.e. could you turn that into a simple function local variable?
src/arena.cc
Outdated
size_t to_allocate = std::max(next_size_, num_bytes); | ||
|
||
Block new_block; | ||
new_block.mem.reset(new char[to_allocate]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Used an aligned new here since there is technically no guarantee that this will happen (yes, I know most allocators would use size_t as a minimum). For now the arena is only used to store character sequences so this doesn't matter, but this could become problematic if it is used later to allocate other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires C++17, no? Is that allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In theory you should be able to use a dedicated type with alignas()
specifier, and use this in a new AlignedType[(to_allocate + sizeof(AlignedType) - 1) & ~sizeof(AlignedType)])
, but this is getting ugly, and frankly there is no need for this feature here. What do you think about dropping the alignment requirement entirely, and just call this StringPieceArena
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arena is taken out now, so we'll deal with it in a separate PR if we get there.
I think everything should be taken care of now, short of the alignment issue. For the memory usage, I wonder if we should consider making a short-string optimization to StringPiece()? It should be possible to have a 15-byte string inline, without allocating anything on the arena. We could then also possibly bring back the concatenation optimization in EvalString, so that if you do AddString("foo"); AddString("bar"); you get a single short-string StringPiece with 6 bytes in them. |
I think Since we're talking about StringPiece / string lookup performance, the Android Ninja fork has introduced a HashedStringView type that embeds a (pointer + 32-bit size + 32-bit hash) to speed up hash table lookups considerably (with a corresponding HashedString type, which is an std::string + 32-bit hash). I had ported the idea on top of my fork some time ago, and usint this improved things noticeably (between 5% and 11% for no-op Ninja builds, depending on the build graph). This could be another way to improve speed, at the cost of extra complexity though, and historically Ninja maintainers have been very reluctant of such changes. It would be nice to know if @nico and @jhasse would be ok with this before trying to implement these schemes. |
I looked a bit more at this; the issue isn't that we have a lot of short stirngs (we don't). it is simply that EvalString doesn't need to live past the end of ManifestParser::ParseDefault(). So we could simply have a small arena that lives only in ManifestParser, and is cleared (with memory available for reuse) at the end of ParseEdge(). It would be a little more complex, but we would probably get rid of all of the memory bloat. What do you think? |
I see some autobuilders are failing, too:
|
Thanks a ton @sesse. I have tried your latest patch with a moderate Fuchsia build plan (e.g. I wanted to see how much the non-arena related improvements impacted performance, and after a small rebase, I got a newer version of your patchset (see attached patch) that actually runs slightly faster, without increasing memory at all. So it looks like the arena is not helping after all. |
The MacOS compilation issue seems to be a bug, this should definitely compile as C++11. I think this is entirely unrelated to this Cl. This is bad :-( For the spelling issue, the codespell invocation in For linting, |
I definitely see differences between the arena and non-arena; there's a measurement there right at the first patch. But like I said, maybe the arena can do with a smaller scope/lifetime. We can review the non-arena parts first, and then come back to it after the other stuff is in? I'm not that keen on all the re-measuring to get the commit messages right, though. |
Looking at the MacOS issue, it looks like the |
Let's try to fix the |
FYI: I have uploaded my rebase at https://github.com/digit-google/ninja/pull/new/sesse-ninja-pr2519-790f571-without-arena if that can help you (you do not have to use it, and I want you to get full credit for this work, just to be clear). |
For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.76 to 5.48 seconds.
I took out the arena (I intend to re-measure it after we have all of the other stuff in). It was my own rebase, but I looked to yours for confirmation in a couple of places, so that was useful; thanks. |
Arrgh, modifying |
You don't know until you try :) Where do the changes in lexer.cc come from? Technically (unlike with e.g. the zlib license) the MIT license is infectious, meaning that we would have to distribute the copyright and the license terms with binary distributions of ninja after this PR. 99% of people aren't aware of this and break this rule all the time though, not sure if it really matters. |
I don't know what changed lexer.cc, I guess something in the build system does? I haven't modified it by hand. :-) |
Ah yes, I think building via Python changes it in-place using re2c. Can you remove that change from your commit? Only needed when modifying lexer.in.cc. |
I don't really trust hyperfine's statistics, but since you wanted comparative measurements:
This is on a 5950X (Zen 3), with a fairly normal NVMe SSD and Debian unstable. |
This very often holds only a single RAW token, so we do not need to allocate elements on an std::vector for it in the common case. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.48 to 5.14 seconds. Note that this opens up for a potential optimization where EvalString::Evaluate() could just return a StringPiece, without making a std::string out of it (which requires allocation; this is about 5% of remaining runtime). However, this would also require that CanonicalizePath() somehow learned to work with StringPiece (presumably allocating a new StringPiece if and only if changes were needed).
This is much faster than std::unordered_map, and also slightly faster than phmap::flat_hash_map that was included in PR ninja-build#2468. It is MIT-licensed, and we just include the .h file wholesale. I haven't done a detailed test of all the various unordered_maps out there, but this is the overall highest-ranking contender on https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ except for ankerl::unordered_dense::map, which requires C++17. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 5.14 to 4.62 seconds.
I took out the lexer.cc changes. |
Can you benchmark only the hasmap change against MSVC's std::unordered_map and libc++ (e.g. on macOS)? |
I haven't really had a Windows machine since 2001 or so, so that's a bit tricky :-) I have a Mac at work, so I can make the test there next week. Or I can probably make a test with libc++ on Linux (with Clang) if that works? |
Yes, would be even more interesting to have a direct comparison on the same system between libstdc++ and libc++ :) |
Linux, still 5950X:
|
@jhasse Is there anything missing here for this to be merged? |
Not from my side, I just remembered that I hate Windows anyway :D Maybe @nico and @digit-google have some comments left though :) Let's wait a bit, then I'll merge. |
Great :-) I still have the arena PR, but we'll deal with that after this. |
This is a great PR, please ship it asap :-) For the record, I have added below my own Linux benchmarks for three versions of the Ninja sources, and several compiler + c++ runtime and memory allocator combos, run against a small Fuchsia build plan, where:
This shows that
|
friendly ping for @jhasse . |
This is the currently fastest hash that passes SMHasher and does not require special instructions (e.g. SIMD). Like emhash8, it is liberally licensed (2-clause BSD), and we include the .h file directly. For a no-op build of Chromium (Linux, Zen 2), this reduces time spent from 4.62 to 4.22 seconds. (NOTE: This is a more difficult measurement than the previous ones, as it necessarily involves removing the entire build log and doing a clean build. However, just switching the HashMap hash takes to 4.47 seconds or so.)
ftell() must go ask the kernel for the file offset, in case someone knew the underlying file descriptor number and seeked it. Thus, we can save a couple hundred thousand syscalls by just caching the offset and maintaining it ourselves. This cuts another ~170ms off a no-op Chromium build.
This cuts off another ~100 ms, most likely because the compiler doesn't have smart enough alias analysis to do the same (trivial) transformation.
This patch series speeds up ninja parsing (as measured by a no-op Chromium build) by about 40–50%. The main win is reducing allocation rate by punting StringPiece allocation to a arena/bump allocator, but we also switch out the hash table and hash functions in use.
The series might seem large, but a) most of it is vendoring EmHash8 and rapidhash, and b) most of the rest is just piping an arena pointer through to the various functions and tests.