-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree artifact up-to-dateness check can be very slow for large tree artifacts #17009
Comments
@coeuvre Might this be a possible application of Loom? |
Perhaps io_uring might also help here? |
Probably, by visiting the tree concurrently using virtual threads. It's not easy to do today because we have already created many platform threads.
Definitely, but integrating io_uring into Bazel is another big project. |
Created #17943 as a prototype to fix this. |
One for getting metadata for the inputs of the action and one for its outputs. The principal difference between the two is that when the action starts executing, the metadata for its inputs should be known, so no expensive I/O should be needed. This would in principle allow one to not throw IOException there. It remains for two reasons: * Some conditions are signaled by throwing FileNotFoundException (I haven't investigated very deeply) * Some implementations still do I/O for reasons unknown At the very least, this will allow one to not throw InterruptedException in getInputMetadata(). Work towards fixing #17009. RELNOTES: None. PiperOrigin-RevId: 523318267 Change-Id: Ib3af4c099a0faafb9a8b6d75d04928af47bfbcd1
It does not throw it yet, but this also requires declaring that exception in a lot of places, so it's better to do this separately. Work towards #17009. RELNOTES: None. PiperOrigin-RevId: 523388258 Change-Id: Idf50f39f3cfcca83aa0a8ff5b5395f80dfa26b69
This makes it possible to use every core available for checksumming, which makes a huge difference for large tree artifacts. Fixes bazelbuild#17009. RELNOTES: None. PiperOrigin-RevId: 525085502 Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b (cherry picked from commit 368bf11) # Conflicts: # src/main/java/com/google/devtools/build/lib/skyframe/TreeArtifactValue.java
@bazel-io fork 6.2.0 |
This makes it possible to use every core available for checksumming, which makes a huge difference for large tree artifacts. Fixes #17009. RELNOTES: None. PiperOrigin-RevId: 525085502 Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b (cherry picked from commit 368bf11) # Conflicts: # src/main/java/com/google/devtools/build/lib/skyframe/TreeArtifactValue.java Co-authored-by: Googler <lberki@google.com>
One for getting metadata for the inputs of the action and one for its outputs. The principal difference between the two is that when the action starts executing, the metadata for its inputs should be known, so no expensive I/O should be needed. This would in principle allow one to not throw IOException there. It remains for two reasons: * Some conditions are signaled by throwing FileNotFoundException (I haven't investigated very deeply) * Some implementations still do I/O for reasons unknown At the very least, this will allow one to not throw InterruptedException in getInputMetadata(). Work towards fixing bazelbuild#17009. RELNOTES: None. PiperOrigin-RevId: 523318267 Change-Id: Ib3af4c099a0faafb9a8b6d75d04928af47bfbcd1
It does not throw it yet, but this also requires declaring that exception in a lot of places, so it's better to do this separately. Work towards bazelbuild#17009. RELNOTES: None. PiperOrigin-RevId: 523388258 Change-Id: Idf50f39f3cfcca83aa0a8ff5b5395f80dfa26b69
This makes it possible to use every core available for checksumming, which makes a huge difference for large tree artifacts. Fixes bazelbuild#17009. RELNOTES: None. PiperOrigin-RevId: 525085502 Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b
We ran into this again. This time, the stack traces show:
Which is indicative of a |
I believe there are two distinct problems here:
I recently did some measurements on a large tree artifact containing 240k files totaling 12GB and large subdirectories with ~4k files (internal b/323077002 has the details) and I determined the following:
After applying both optimizations, the trace profile shows that This is on a 32-core machine with an SSD, so numbers may vary depending on available parallelism and I/O performance. |
…mediate results in a trie. When scanning a filesystem tree, resolveSymbolicLinks does O(M*N) work, where M is the number of components in a file path and N is the number of files. This CL makes it O(M+N) instead. This makes large output tree artifacts (~250k files) much more efficient to scan (from ~45s to ~9s in a particular example; additional optimizations are possible and will be made in a followup CL). Related to #17009. PiperOrigin-RevId: 606259673 Change-Id: Icf781a78b3271196e0029e3049d969a9e6073906
…ng intermediate results in a trie. When scanning a filesystem tree, resolveSymbolicLinks does O(M*N) work, where M is the number of components in a file path and N is the number of files. This CL makes it O(M+N) instead. This makes large output tree artifacts (~250k files) much more efficient to scan (from ~45s to ~9s in a particular example; additional optimizations are possible and will be made in a followup CL). Related to bazelbuild#17009. PiperOrigin-RevId: 606259673 Change-Id: Icf781a78b3271196e0029e3049d969a9e6073906
…ng intermediate results in a trie. (#21333) When scanning a filesystem tree, resolveSymbolicLinks does O(M*N) work, where M is the number of components in a file path and N is the number of files. This CL makes it O(M+N) instead. This makes large output tree artifacts (~250k files) much more efficient to scan (from ~45s to ~9s in a particular example; additional optimizations are possible and will be made in a followup CL). Related to #17009. PiperOrigin-RevId: 606259673 Change-Id: Icf781a78b3271196e0029e3049d969a9e6073906
…rectories. This performs better when the subdirectories are unbalanced (and doesn't degrade catastrophically for a flat hierarchy). Most tree artifacts are too small for this to matter, but some users have very large ones (with hundreds of thousands of files) for which this can reduce the overall traversal time by 30% or more (after other, more important optimizations such as f2512a0 have been made). Also remove the edge case for the root directory; the code is cleaner that way. Related to #17009. PiperOrigin-RevId: 606897861 Change-Id: I143d55a844ac191543a856f73849a95560199468
…of subdirectories. This performs better when the subdirectories are unbalanced (and doesn't degrade catastrophically for a flat hierarchy). Most tree artifacts are too small for this to matter, but some users have very large ones (with hundreds of thousands of files) for which this can reduce the overall traversal time by 30% or more (after other, more important optimizations such as bazelbuild@f2512a0 have been made). Also remove the edge case for the root directory; the code is cleaner that way. Related to bazelbuild#17009. PiperOrigin-RevId: 606897861 Change-Id: I143d55a844ac191543a856f73849a95560199468
…of subdirectories. (#21347) This performs better when the subdirectories are unbalanced (and doesn't degrade catastrophically for a flat hierarchy). Most tree artifacts are too small for this to matter, but some users have very large ones (with hundreds of thousands of files) for which this can reduce the overall traversal time by 30% or more (after other, more important optimizations such as f2512a0 have been made). Also remove the edge case for the root directory; the code is cleaner that way. Related to #17009. PiperOrigin-RevId: 606897861 Change-Id: I143d55a844ac191543a856f73849a95560199468
I think this has now been optimized as reasonably as it can. |
Description of the bug:
When checking whether a local action cache entry is up-to-date, it takes a long time to check actions that have large tree artifacts on their inputs. The stack trace when Bazel is working on this is:
My theory is that this is because the visitation happens on a single thread in
TreeArtifactValue.visitTree()
when called fromActionMetadataHandler.constructTreeArtifactValueFromFilesystem()
.What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Take this BUILD file:
Which operating system are you running Bazel on?
Linux @ Google
What is the output of
bazel info release
?development version
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.From git commit de4746d .
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: