Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree artifact up-to-dateness check can be very slow for large tree artifacts #17009

Closed
lberki opened this issue Dec 13, 2022 · 8 comments
Closed
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@lberki
Copy link
Contributor

lberki commented Dec 13, 2022

Description of the bug:

When checking whether a local action cache entry is up-to-date, it takes a long time to check actions that have large tree artifacts on their inputs. The stack trace when Bazel is working on this is:

   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(java.base@11.0.6/Native Method)
        at java.io.FileInputStream.read(java.base@11.0.6/Unknown Source)
        at com.google.common.io.ByteStreams.copy(ByteStreams.java:114)
        at com.google.common.io.ByteSource.copyTo(ByteSource.java:257)
        at com.google.common.io.ByteSource.hash(ByteSource.java:340)
        at com.google.devtools.build.lib.vfs.FileSystem.getDigest(FileSystem.java:339)
        at com.google.devtools.build.lib.unix.UnixFileSystem.getDigest(UnixFileSystem.java:452)
        at com.google.devtools.build.lib.vfs.Path.getDigest(Path.java:690)
        at com.google.devtools.build.lib.vfs.DigestUtils.manuallyComputeDigest(DigestUtils.java:194)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructFileArtifactValue(ActionMetada
taHandler.java:564)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructFileArtifactValueFromFilesyste
m(ActionMetadataHandler.java:496)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.lambda$constructTreeArtifactValueFromFi
lesystem$0(ActionMetadataHandler.java:354)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler$$Lambda$1121/0x0000000800857040.visit(Unknown Source)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:411)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:393)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructTreeArtifactValueFromFilesystem(ActionMetadataHandler.java:342)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.getTreeArtifactValue(ActionMetadataHandler.java:317)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.getMetadata(ActionMetadataHandler.java:265)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.getMetadataOrConstant(ActionCacheChecker.java:566)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.getMetadataMaybe(ActionCacheChecker.java:579)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.validateArtifacts(ActionCacheChecker.java:207)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.mustExecute(ActionCacheChecker.java:541)

My theory is that this is because the visitation happens on a single thread in TreeArtifactValue.visitTree() when called from ActionMetadataHandler.constructTreeArtifactValueFromFilesystem().

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Take this BUILD file:

touch WORKSPACE
mkdir -p r
cat > r/BUILD <<'EOF'
load(":r.bzl", "r")

r(name = "ta")

genrule(
    name = "c",
    srcs = [":ta"],
    outs = ["co"],
    cmd = "find $(location :ta) > $@",
)

sh_binary(
    name = "gen",
    srcs = ["gen.sh"],
)
EOF

cat > r/r.bzl << 'EOF'
def _r_impl(ctx):
    ta = ctx.actions.declare_directory("d")
    ctx.actions.run(
        outputs = [ta],
        inputs = [],
        executable = ctx.executable._gen,
        arguments = [ta.path],
    )
    return [DefaultInfo(files = depset([ta]))]

r = rule(
    implementation = _r_impl,
    attrs = {
        "_gen": attr.label(default = "//r:gen", executable = True, cfg = "exec"),
    },
)
EOF

cat > r/gen.sh <<'EOF'
#!/bin/bash

OUT="$1"
mkdir -p "$OUT"

for i in $(seq 1 10); do
  for j in $(seq 1 10); do
    for k in $(seq 1 100); do
      mkdir -p "$OUT/$i/$j"
      #echo "$i $j $k" > "$OUT/$i/$j/$k"
      dd if=/dev/random of="$OUT/$i/$j/$k" bs=1024 count=1024
    done
  done
done
echo hello > "$OUT/hello"
EOF

chmod +x r/gen.sh

bazel build //r:c
bazel shutdown
bazel build //r:c  # This is slow

Which operating system are you running Bazel on?

Linux @ Google

What is the output of bazel info release?

development version

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

From git commit de4746d .

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@lberki lberki added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Dec 13, 2022
@zhengwei143 zhengwei143 added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Dec 20, 2022
@zhengwei143
Copy link
Contributor

@coeuvre Might this be a possible application of Loom?

@zhengwei143
Copy link
Contributor

Perhaps io_uring might also help here?

@coeuvre
Copy link
Member

coeuvre commented Jan 2, 2023

@coeuvre Might this be a possible application of Loom?

Probably, by visiting the tree concurrently using virtual threads. It's not easy to do today because we have already created many platform threads.

Perhaps io_uring might also help here?

Definitely, but integrating io_uring into Bazel is another big project.

@lberki
Copy link
Contributor Author

lberki commented Mar 31, 2023

Created #17943 as a prototype to fix this.

copybara-service bot pushed a commit that referenced this issue Apr 11, 2023
One for getting metadata for the inputs of the action and one for its outputs.

The principal difference between the two is that when the action starts
executing, the metadata for its inputs should be known, so no expensive I/O
should be needed. This would in principle allow one to not throw IOException
there. It remains for two reasons:

* Some conditions are signaled by throwing FileNotFoundException (I haven't
  investigated very deeply)
* Some implementations still do I/O for reasons unknown

At the very least, this will allow one to not throw InterruptedException in
getInputMetadata().

Work towards fixing #17009.

RELNOTES: None.
PiperOrigin-RevId: 523318267
Change-Id: Ib3af4c099a0faafb9a8b6d75d04928af47bfbcd1
copybara-service bot pushed a commit that referenced this issue Apr 11, 2023
It does not throw it yet, but this also requires declaring that exception
in a lot of places, so it's better to do this separately.

Work towards #17009.

RELNOTES: None.
PiperOrigin-RevId: 523388258
Change-Id: Idf50f39f3cfcca83aa0a8ff5b5395f80dfa26b69
BalestraPatrick pushed a commit to BalestraPatrick/bazel that referenced this issue Apr 24, 2023
This makes it possible to use every core available for checksumming, which
makes a huge difference for large tree artifacts.

Fixes bazelbuild#17009.

RELNOTES: None.
PiperOrigin-RevId: 525085502
Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b
(cherry picked from commit 368bf11)

# Conflicts:
#	src/main/java/com/google/devtools/build/lib/skyframe/TreeArtifactValue.java
@keertk
Copy link
Member

keertk commented Apr 25, 2023

@bazel-io fork 6.2.0

keertk pushed a commit that referenced this issue Apr 25, 2023
This makes it possible to use every core available for checksumming, which
makes a huge difference for large tree artifacts.

Fixes #17009.

RELNOTES: None.
PiperOrigin-RevId: 525085502
Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b
(cherry picked from commit 368bf11)

# Conflicts:
#	src/main/java/com/google/devtools/build/lib/skyframe/TreeArtifactValue.java

Co-authored-by: Googler <lberki@google.com>
fweikert pushed a commit to fweikert/bazel that referenced this issue May 25, 2023
One for getting metadata for the inputs of the action and one for its outputs.

The principal difference between the two is that when the action starts
executing, the metadata for its inputs should be known, so no expensive I/O
should be needed. This would in principle allow one to not throw IOException
there. It remains for two reasons:

* Some conditions are signaled by throwing FileNotFoundException (I haven't
  investigated very deeply)
* Some implementations still do I/O for reasons unknown

At the very least, this will allow one to not throw InterruptedException in
getInputMetadata().

Work towards fixing bazelbuild#17009.

RELNOTES: None.
PiperOrigin-RevId: 523318267
Change-Id: Ib3af4c099a0faafb9a8b6d75d04928af47bfbcd1
fweikert pushed a commit to fweikert/bazel that referenced this issue May 25, 2023
It does not throw it yet, but this also requires declaring that exception
in a lot of places, so it's better to do this separately.

Work towards bazelbuild#17009.

RELNOTES: None.
PiperOrigin-RevId: 523388258
Change-Id: Idf50f39f3cfcca83aa0a8ff5b5395f80dfa26b69
fweikert pushed a commit to fweikert/bazel that referenced this issue May 25, 2023
This makes it possible to use every core available for checksumming, which
makes a huge difference for large tree artifacts.

Fixes bazelbuild#17009.

RELNOTES: None.
PiperOrigin-RevId: 525085502
Change-Id: I2a995d3445940333c21eeb89b4ba60887f99e51b
@lberki
Copy link
Contributor Author

lberki commented Feb 5, 2024

We ran into this again. This time, the stack traces show:

"tree-artifact-visitor-46" #1800 [144416] daemon prio=5 os_prio=0 cpu=28027.44ms elapsed=12463.90s tid=0x00007f
b740001e70 nid=144416 runnable  [0x00007fbb95efe000]
   java.lang.Thread.State: RUNNABLE
        at com.google.devtools.build.lib.vfs.inmemoryfs.InMemoryFileSystem.stat(InMemoryFileSystem.java:360)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.statUnchecked(RemoteActionFileSystem.java:612)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.resolveOneLink(RemoteActionFileSystem.java:533)
        at com.google.devtools.build.lib.vfs.FileSystem.appendSegment(FileSystem.java:377)
        at com.google.devtools.build.lib.vfs.FileSystem.resolveSymbolicLinks(FileSystem.java:445)
<the same like ten times>
        at com.google.devtools.build.lib.vfs.FileSystem.resolveSymbolicLinks(FileSystem.java:445)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.statUnchecked(RemoteActionFileSystem.java:594)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.stat(RemoteActionFileSystem.java:582)
        at com.google.devtools.build.lib.remote.RemoteActionFileSystem.statIfFound(RemoteActionFileSystem.java:564)
        at com.google.devtools.build.lib.vfs.Path.statIfFound(Path.java:322)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.fileArtifactValueFromArtifact(ActionOutputMetadataStore.java:599)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.constructFileArtifactValue(ActionOutputMetadataStore.java:461)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.constructFileArtifactValueFromFilesystem(ActionOutputMetadataStore.java:449)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore.lambda$constructTreeArtifactValueFromFilesystem$0(ActionOutputMetadataStore.java:298)
        at com.google.devtools.build.lib.skyframe.ActionOutputMetadataStore$$Lambda/0x00000008009c39f0.visit(Unknown Source)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue$Visitor.visitTree(TreeArtifactValue.java:548)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue$Visitor.lambda$visitTree$1(TreeArtifactValue.java:551)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue$Visitor$$Lambda/0x00000008009c4508.run(Unknown Source)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)

Which is indicative of a RemoteActionFileSystem being slow, even when called on multiple threads.

@lberki lberki reopened this Feb 5, 2024
@tjgq
Copy link
Contributor

tjgq commented Feb 5, 2024

I believe there are two distinct problems here:

  1. RemoteActionFileSystem#resolveSymbolicLinks is slow because it stats paths on the underlying filesystem one component at a time and there's no cache. So if you have a tree artifact with N files and every file has M path components, collecting the TreeArtifactValue incurs in O(N*M) stats. Adding a cache is possible, but the invalidation logic is nontrivial (we cannot assume that a path statted in the past won't change in the future, because files can be deleted or moved around within an action execution).
  2. TreeArtifactValue.visitTree parallelizes over subdirectories, not individual files. So if the tree artifact has a very large subdirectory with N files (or a single flat directory, in the worst case), collecting the TreeArtifactValue is still as slow as O(N) sequential stats.

I recently did some measurements on a large tree artifact containing 240k files totaling 12GB and large subdirectories with ~4k files (internal b/323077002 has the details) and I determined the following:

  • Currently, TreeArtifactValue.visitTree takes ~40s (~160us/file)
  • Adding a "dentry cache" for resolveSymbolicLinks brings it down to ~9s (disclaimer: prototype quality, doing it properly with correct invalidation might be slower)
  • Parallelizing across files instead of subdirectories (in addition to the dentry cache) brings it down to ~6s (~25us/file)
  • However, parallelizing across files does nothing on its own (it only compounds with the other optimization)

After applying both optimizations, the trace profile shows that visitTree is able to use all of the available cores during most of its execution, suggesting that they're sufficient to avoid blocking on I/O; further gains might be possible on the CPU side, but I haven't looked into it closely.

This is on a 32-core machine with an SSD, so numbers may vary depending on available parallelism and I/O performance.

copybara-service bot pushed a commit that referenced this issue Feb 12, 2024
…mediate results in a trie.

When scanning a filesystem tree, resolveSymbolicLinks does O(M*N) work, where M is the number of components in a file path and N is the number of files. This CL makes it O(M+N) instead. This makes large output tree artifacts (~250k files) much more efficient to scan (from ~45s to ~9s in a particular example; additional optimizations are possible and will be made in a followup CL).

Related to #17009.

PiperOrigin-RevId: 606259673
Change-Id: Icf781a78b3271196e0029e3049d969a9e6073906
tjgq added a commit to tjgq/bazel that referenced this issue Feb 13, 2024
…ng intermediate results in a trie.

When scanning a filesystem tree, resolveSymbolicLinks does O(M*N) work, where M is the number of components in a file path and N is the number of files. This CL makes it O(M+N) instead. This makes large output tree artifacts (~250k files) much more efficient to scan (from ~45s to ~9s in a particular example; additional optimizations are possible and will be made in a followup CL).

Related to bazelbuild#17009.

PiperOrigin-RevId: 606259673
Change-Id: Icf781a78b3271196e0029e3049d969a9e6073906
github-merge-queue bot pushed a commit that referenced this issue Feb 13, 2024
…ng intermediate results in a trie. (#21333)

When scanning a filesystem tree, resolveSymbolicLinks does O(M*N) work,
where M is the number of components in a file path and N is the number
of files. This CL makes it O(M+N) instead. This makes large output tree
artifacts (~250k files) much more efficient to scan (from ~45s to ~9s in
a particular example; additional optimizations are possible and will be
made in a followup CL).

Related to #17009.

PiperOrigin-RevId: 606259673
Change-Id: Icf781a78b3271196e0029e3049d969a9e6073906
copybara-service bot pushed a commit that referenced this issue Feb 14, 2024
…rectories.

This performs better when the subdirectories are unbalanced (and doesn't degrade catastrophically for a flat hierarchy). Most tree artifacts are too small for this to matter, but some users have very large ones (with hundreds of thousands of files) for which this can reduce the overall traversal time by 30% or more (after other, more important optimizations such as f2512a0 have been made).

Also remove the edge case for the root directory; the code is cleaner that way.

Related to #17009.

PiperOrigin-RevId: 606897861
Change-Id: I143d55a844ac191543a856f73849a95560199468
tjgq added a commit to tjgq/bazel that referenced this issue Feb 14, 2024
…of subdirectories.

This performs better when the subdirectories are unbalanced (and doesn't degrade catastrophically for a flat hierarchy). Most tree artifacts are too small for this to matter, but some users have very large ones (with hundreds of thousands of files) for which this can reduce the overall traversal time by 30% or more (after other, more important optimizations such as bazelbuild@f2512a0 have been made).

Also remove the edge case for the root directory; the code is cleaner that way.

Related to bazelbuild#17009.

PiperOrigin-RevId: 606897861
Change-Id: I143d55a844ac191543a856f73849a95560199468
github-merge-queue bot pushed a commit that referenced this issue Feb 14, 2024
…of subdirectories. (#21347)

This performs better when the subdirectories are unbalanced (and doesn't
degrade catastrophically for a flat hierarchy). Most tree artifacts are
too small for this to matter, but some users have very large ones (with
hundreds of thousands of files) for which this can reduce the overall
traversal time by 30% or more (after other, more important optimizations
such as
f2512a0
have been made).

Also remove the edge case for the root directory; the code is cleaner
that way.

Related to #17009.

PiperOrigin-RevId: 606897861
Change-Id: I143d55a844ac191543a856f73849a95560199468
@tjgq
Copy link
Contributor

tjgq commented Feb 14, 2024

I think this has now been optimized as reasonably as it can.

@tjgq tjgq closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants