Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pants run and pants package failing with backtrack due to missing digest #19748

Open
engnatha opened this issue Sep 1, 2023 · 20 comments
Open
Assignees
Milestone

Comments

@engnatha
Copy link

engnatha commented Sep 1, 2023

Describe the bug
A little hard to capture exactly what was happening. In our CI environment, we were getting a lot of intermittent failures with a similar traceback about a failure to find a process to backtrack for a "missing digest". A repro is elusive here as we couldn't replicate on user accounts.

Here's a sample traceback from when we were had things configured to package a pex binary and then run that binary.

23:36:29.59 [WARN] Making attempt 1 to backtrack and retry `Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<5,>=4.4.0, pika<2,>=1.2.1, protobuf==3.19.5, psutil<6,>=5.8.0, pyjwt<3,>=2, pyserial<4,>=3.4, python-can<4,>=3.1.1, pyudev<1,>=0.24.0, pyyaml<6,>=5.4.1, pyzmq<26,>=25.1, requests<3,>=2.28, setuptools<66,>=65.5.1, tabulate<1,>=0.8.1, typeguard<3,>=2.12.1, types-PyYAML<=6.0.12.6,>=6.0.0, types-click<8,>=7.0.0, typing-extensions<5,>=4.1.1, yubihsm[http]<3,>=2.1`, due to missing digest Digest { hash: Fingerprint<2427f1221b498b758272f1493ffcd7750962e363d32f933b3b80cddcd3d33107>, size_bytes: 14208017 }.
23:36:29.59 [INFO] Dirtying @rule(pants.engine.process.fallible_to_exec_result_or_raise())
23:36:29.59 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.build_pex(astranis.sidekick.automated_remote_io_test/buildbot_robin.pex))
23:36:29.59 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.create_pex())
23:36:29.59 [INFO] Dirtying @rule(pants.backend.python.goals.package_pex_binary.built_pacakge_for_pex_from_targets_request(astranis/sidekick/automated_remote_io_test:buildbot_robin))
23:36:29.59 [INFO] Dirtying @rule(pants.core.goals.package.environment_aware_package())
23:36:29.59 [INFO] Dirtying @rule(pants.core.goals.package.package_asset())
23:36:29.59 [INFO] Dirtying Package
23:36:29.59 [INFO] Filesystem changed during run: retrying `@rule(pants.core.goals.package.package_asset())` in 500ms...
23:36:30.09 [INFO] Starting: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.09 [INFO] Completed: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.11 [INFO] Starting: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.16 [INFO] Completed: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.18 [WARN] Making attempt 1 to backtrack and retry `Building local_dists.pex`, due to missing digest Digest { hash: Fingerprint<fe8697b96cf36d668c6c41fa27b4f06f1202579e313cac5de6cae1a3040572d4>, size_bytes: 607533 }.
23:36:30.18 [INFO] Dirtying @rule(pants.engine.process.fallible_to_exec_result_or_raise())
23:36:30.18 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.build_pex(local_dists.pex))
23:36:30.18 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.create_pex(local_dists.pex))
23:36:30.18 [INFO] Dirtying @rule(pants.backend.python.util_rules.local_dists.build_local_dists())
23:36:30.18 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex_from_targets.create_pex_from_targets())
23:36:30.18 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.create_pex())
23:36:30.18 [INFO] Dirtying @rule(pants.backend.python.goals.package_pex_binary.built_pacakge_for_pex_from_targets_request(astranis/sidekick/automated_remote_io_test:buildbot_robin))
23:36:30.18 [INFO] Dirtying @rule(pants.core.goals.package.environment_aware_package())
23:36:30.18 [INFO] Dirtying @rule(pants.core.goals.package.package_asset())
23:36:30.18 [INFO] Dirtying Package
23:36:30.18 [INFO] Filesystem changed during run: retrying `Process(Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<5,>=4.4.0, pika<2,>=1.2.1, protobuf==3.19.5, psutil<6,>=5.8.0, pyjwt<3,>=2, pyserial<4,>=3.4, python-can<4,>=3.1.1, pyudev<1,>=0.24.0, pyyaml<6,>=5.4.1, pyzmq<26,>=25.1, requests<3,>=2.28, setuptools<66,>=65.5.1, tabulate<1,>=0.8.1, typeguard<3,>=2.12.1, types-PyYAML<=6.0.12.6,>=6.0.0, types-click<8,>=7.0.0, typing-extensions<5,>=4.1.1, yubihsm[http]<3,>=2.1)` in 500ms...
23:36:30.69 [INFO] Starting: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.69 [INFO] Completed: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.73 [INFO] Starting: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.73 [INFO] Completed: Building 23 requirements for astranis.sidekick.automated_remote_io_test/buildbot_robin.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,... (393 characters truncated)
23:36:30.74 [INFO] Starting: Building local_dists.pex
23:36:30.74 [INFO] Completed: Building local_dists.pex
23:36:30.75 [INFO] Starting: Building local_dists.pex
23:36:30.75 [INFO] Completed: Building local_dists.pex
23:36:30.76 [INFO] Starting: Building local_dists.pex
23:36:30.76 [INFO] Completed: Building local_dists.pex
23:36:30.76 [INFO] Starting: Building local_dists.pex
23:36:30.76 [INFO] Completed: Building local_dists.pex
23:36:30.77 [INFO] Starting: Building local_dists.pex
23:36:30.77 [INFO] Completed: Building local_dists.pex
23:36:30.79 [INFO] Starting: Building local_dists.pex
23:36:30.79 [INFO] Completed: Building local_dists.pex
23:36:30.80 [INFO] Starting: Building local_dists.pex
23:36:30.80 [INFO] Completed: Building local_dists.pex
23:36:30.81 [INFO] Starting: Building local_dists.pex
23:36:30.81 [INFO] Completed: Building local_dists.pex
23:36:30.82 [ERROR] 1 Exception encountered:
Engine traceback:
  in `package` goal
IntrinsicError: Could not identify a process to backtrack to for: Missing digest: Was not present in the local store: Digest { hash: Fingerprint<faad51a6a108fba9d40b2a10e82a2646fccbaf8c3d9be47818f4bffae02d94b8>, size_bytes: 4098329 }

An identical error was raised when we ran things with just a pants run command and avoided package altogether.

pants run astranis/sidekick/automated_remote_io_test/buildbot_robin.py -- --sk-dir /buildbot/astranis-monorepo-sk-integration/build/sk
16:01:19.44 [WARN] Making attempt 1 to backtrack and retry `Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<5,>=4.4.0, pika<2,>=1.2.1, protobuf==3.19.5, psutil<6,>=5.8.0, pyjwt<3,>=2, pyserial<4,>=3.4, python-can<4,>=3.1.1, pyudev<1,>=0.24.0, pyyaml<6,>=5.4.1, pyzmq<26,>=25.1, requests<3,>=2.28, setuptools<66,>=65.5.1, tabulate<1,>=0.8.1, typeguard<3,>=2.12.1, types-PyYAML<=6.0.12.6,>=6.0.0, types-click<8,>=7.0.0, typing-extensions<5,>=4.1.1, yubihsm[http]<3,>=2.1`, due to missing digest Digest { hash: Fingerprint<839dc98b8dde47328cbdf3989d52c7e70cc352c955e4a343988e26d60e97fa5f>, size_bytes: 1132584 }.
16:01:19.44 [INFO] Dirtying @rule(pants.engine.process.fallible_to_exec_result_or_raise())
16:01:19.44 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.build_pex(automated_remote_io_test.pex))
16:01:19.44 [INFO] Dirtying @rule(pants.backend.python.util_rules.pex.create_venv_pex())
16:01:19.44 [INFO] Dirtying @rule(pants.backend.python.goals.run_python_source.create_python_source_run_request(astranis/sidekick/automated_remote_io_test/buildbot_robin.py))
16:01:19.44 [INFO] Dirtying @rule(pants.core.goals.run.run())
16:01:19.44 [INFO] Dirtying @rule(pants.core.goals.run.run())
16:01:19.44 [INFO] Dirtying Run
16:01:19.44 [INFO] Dirtying Run
16:01:19.44 [INFO] Filesystem changed during run: retrying `@rule(pants.core.goals.run.run())` in 500ms...
16:01:19.94 [INFO] Starting: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:19.94 [INFO] Completed: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:19.97 [INFO] Starting: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:19.97 [INFO] Completed: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.01 [INFO] Starting: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.01 [INFO] Completed: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.03 [INFO] Starting: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.03 [INFO] Completed: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.05 [INFO] Starting: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.05 [INFO] Completed: Building 23 requirements for automated_remote_io_test.pex from the constraints.txt resolve: aiohttp<4,>=3.8.3, click<9.0,>=8.0, coloredlogs<16,>=15, cryptography<42.0,>=41.0, hvac<1,>=0.11.2, pexpect<... (360 characters truncated)
16:01:20.06 [ERROR] 1 Exception encountered:
Engine traceback:
  in `run` goal
IntrinsicError: Could not identify a process to backtrack to for: Missing digest: Was not present in the local store: Digest { hash: Fingerprint<b88f117ae5cf142f938fdf85baf88ea3862c73a0913ebbec8de24dc07706daff>, size_bytes: 639039 }

Pants version
2.17.0 with the 0.10.0 launcher binary

OS
Ubuntu 20.04

Additional info
Interestingly, at the recommendation of @huonw , we tried prepending PANTS_LOCAL_STORE_DIR=$(mktemp -d) <cmd> and things worked repeatedly enough when force scheduling builds.

This led us to believe that wiping the cache on that runner would make things happy. Unfortunately, the next build after resetting the cache resulted in the same errors above.

@engnatha engnatha added the bug label Sep 1, 2023
@huonw huonw added this to the 2.17.x milestone Sep 1, 2023
@engnatha
Copy link
Author

engnatha commented Sep 1, 2023

Not sure if #16667 is related, but we're using fetch for cache_content_behavior. I'd like to avoid uploading the whole config file to avoid looping in security, but let me know if there's any other values that would be interesting to know about.

@huonw
Copy link
Contributor

huonw commented Sep 1, 2023

@engnatha confirmed on slack (https://pantsbuild.slack.com/archives/C046T6T9U/p1693541874862319?thread_ts=1693439136.489919&cid=C046T6T9U) they use a remote cache (bazel-remote-cache).


I looked over the diff between 2.16.0 and 2.17.0 https://github.com/pantsbuild/pants/compare/release_2.16.0..release_2.17.0 , and particularly the changes to the "core" caching code: git diff --stat release_2.16.0 release_2.17.0 -- src/rust/engine/process_execution/ src/rust/engine/fs/store/ src/rust/engine/sharded_lmdb

From my skim, it looks like the vaguely-interesting changes are:

  1. Materialize "large" files in a new store location and hardlink them in sandboxes #18153, and associated work like Clear FSDB materialization cache when removeing a file #18747, Flush metadata to disk before renaming immutable store files #18768, Don't try and garbage collect tempfiles (Cherry-pick of #19103) #19113
  2. Expose environment information on process metadata #18458

As a weak bit of evidence, the digests mentioned in the logs have size 14208017, 607533, 4098329, 1132584, 639039, all of which are larger than the 524288 (512KiB) LARGE_FILE_SIZE_LIMIT limit above which the "FSDB" code in 1 kicks in. I'm going to assign @thejcannon based on that, but feel free to assign onwards if this is incorrect.

Full log of changes to those files
git log --oneline release_2.16.0a0..release_2.17.0 -- src/rust/engine/process_execution/ src/rust/engine/fs/store/ src/rust/engine/sharded_lmdb/

Some of these were likely cherry-picked to 2.16.0. I haven't filtered them out.

8da4c8d Have Pants create the immutable_inputs base parent(s) (Cherry-pick of #19646) (#19652)
1abb154 Don't try and garbage collect tempfiles (Cherry-pick of #19103) (#19113)
b8f0ed0 Replace Snapshot::_unsafe_create with create_for_testing. (#18995)
bee75a9 Make DigestSubset symlink-aware (#18963)
07dbcdb Remove existing entry if any when materialising symlink (#18873)
4e39b72 Update to tokio 1.28.0 (#18874)
dcdd1bc Upgrade tempfile and time crates (#18846)
7e6b5bc Use a released version of bollard (#18777)
1aca349 Get rid of spectral rust crate. (#18774)
e87c3c6 Flush metadata to disk before renaming immutable store files (#18768)
9616025 Clear FSDB materialization cache when removeing a file (#18747)
cb01cbe Remove implicit conversion from std::io::Error to StoreError (#18696)
317a09d Add timeout to remote store calls, and adjust name of cache timeout. (#18695)
1462bef Ensure that sandboxed processes exit before their sandboxes are cleaned up (#18632)
c398899 Only materialize immutable files once per process (#18600)
194862a Remove safe_create_dir(_all) in favor of stdlib implementation. (#18605)
cd832f2 Improve debug information for docker auth, and expose an environment variable. (#18599)
e69beff Set the entire env in __run.sh (#18582)
c36a988 Materialize "large" files in a new store location and hardlink them in sandboxes (#18153)
b23df09 Add support for Docker registry auth (#18541)
5ce027b Bump env_logger from 0.9.0 to 0.10.0 in /src/rust/engine (#17688)
81c2fa3 Re-enable hardlinking using fclonefileat on macOS. (#18501)
3236b4e Disable hardlinking of immutable inputs (#18487)
f8cc00b Finish parallelizing materialization of Process inputs (#18469)
15ecb9e Expose environment information on process metadata (#18458)
d88e258 Removing more unwraps from FFI layer (#18370)

@thejcannon
Copy link
Member

The FSDB changes might be a red herring, FWIW https://pantsbuild.slack.com/archives/C046T6T9U/p1672247391875929 was in Dec 2022 with a file of 35 bytes. And another in December: https://pantsbuild.slack.com/archives/C046T6T9U/p1671161642895819?thread_ts=1666215970.400249&cid=C046T6T9U

The FSDB changes were merged 5 months ago.

This particular thread from October seems relevant: https://pantsbuild.slack.com/archives/C0D7TNJHL/p1665440874737089?thread_ts=1665440874.737089&cid=C0D7TNJHL and that led to #17204

I'll CC @stuhood and it'll be a waiting game to see who has time first 😈

@thejcannon
Copy link
Member

The fact that this fix went into 2.14, and for 2.15 and 2.16 we don't see it again does hint to me that the FSDB likely either unearthed another bug, or somehow re-introduces the previous one

@stuhood
Copy link
Member

stuhood commented Sep 1, 2023

Does PANTS_LOCAL_STORE_DIR point to the default location in your config? Do you have any temp-file cleaners running on that directory, or anything that might be deleting files?

With cache_content_behavior=fetch (please definitely confirm that that is what you are using!), the only way for this to occur should be for a digest which one codepath claimed existed on disk to later (within the same process) not actually exist. The potential relation to the FSDB changes is that they write different sized blobs to different locations, and so if all codepaths aren't in alignment about where blobs go, one might see it on disk, while another looks in the wrong place and does not.

@engnatha
Copy link
Author

engnatha commented Sep 5, 2023

We do not override PANTS_LOCAL_STORE_DIR. All of this runs in github actions where we preserve the contents of ~/.pex and ~/.cache/pants.

Confirmed that we have fetch set. No other places I see in our repo. I don't see any environment variables configured to override it either.
image

@huonw huonw added the remote label Sep 6, 2023
@huonw
Copy link
Contributor

huonw commented Sep 9, 2023

so if all codepaths aren't in alignment about where blobs go

I have two theories:

  1. there might be codepath that's missing a should_use_fsdb check somewhere (I haven't tried reading the code to check this, yet)
  2. there's a codepath that adds additional conditions to the FSDB size-check one, potentially forcing a too-large digest to end up in LMDB "incorrectly":
    f_remote.is_none() && ByteStore::should_use_fsdb(entry_type, digest.size_bytes);

2 is most concrete. However, I believe the check only kicks in for loading a directory from the remote, and while it doesn't seem impossible that Directory could end up larger than 512KiB, it seems kinda implausible? (A single directory seems like it would to have tens of thousands of children, to exceed that?)

pub struct Directory {
     pub files: Vec<FileNode>,
     pub directories: Vec<DirectoryNode>,
     pub symlinks: Vec<SymlinkNode>,
     pub node_properties: Option<NodeProperties>, 
}

(That said, even if 2 is unlikely, it definitely seems like a latent bug, and so should be fixed.)

@huonw
Copy link
Contributor

huonw commented Sep 10, 2023

However, I believe the check only kicks in for loading a directory from the remote

Hm, I tried constructing a failing example for this, and couldn't: it seems like Pants will only upload ActionResult, and they only store directory Trees (via OutputDirectory). A Tree incorporates all of its children, and is downloaded but never persisted into the local store by Store::load_tree_from_remote, and there's no further downloading of Directorys... I'm a bit surprised by this, though.

(Structs are references to protobufs in protos::gen::build::bazel::remote::execution::v2)

The example I was working with which creates a 1MB+ "directory" using some really long file names, which I ran with bazel-remote running in the background:

[GLOBAL]
pants_version = "2.17.0"
backend_packages = ["pants.backend.shell"]

remote_store_address = "grpc://localhost:9092"
remote_cache_read = true
remote_cache_write = true

pantsd = false
# bump this to force a fresh local cache:
local_store_dir = "/tmp/19748-backtrack-8"
shell_command(
    name="generate",
    command="""
    mkdir -p foo/bar
    for i in $(seq 1 5000); do
       touch foo/bar/long-12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890-$i
    done
    """,
    tools=["mkdir", "touch", "seq"],
    output_directories=["foo/"],
)

shell_command(
    name="invalidator",
    # bump this to force the archive process to rebuild
    command="echo 5 > file.txt",
    output_files=["file.txt"]
)

archive(name="consume", files=[":generate", ":invalidator"], format="zip")

@huonw
Copy link
Contributor

huonw commented Sep 10, 2023

(Turns out directories are never stored in FSDB, so they're probably not a cause of the issue here, i.e. theory 2 in #19748 (comment) is wrong, and my reproduction attempt above was never going to fail. #19815 makes this more obvious.)

huonw added a commit that referenced this issue Sep 11, 2023
This adjusts `RemoteStore::download_digest_to_local` to enforce that the
`f_remote` validation function is never provided for digests that should
be stored in FSDB. This is no-functionality-change, just refactoring
things to be slightly clearer and more future-proof.

Previously, this code was conditionalising whether something should be
stored in FSDB on _more_ than just the type/size, which _could_ have led
to bugs: if we were downloading a digest that should be stored in FSDB,
but pass `f_remote = Some(...)`, `download_digest_to_local` would
instead store it in LMDB, and future look-ups of the digest (which don't
know about `f_remote`) would look in FSDB and fail to find it.

However, in practice, we only ever pass `f_remote` in
`Store::load_directory`, and directories are never stored in the FSDB,
even if they're huge, so the extra condition was not changing behaviour.
This PR is making the assumption/behaviour/luck here more explicit, and
adds a test validating the behaviour of huge directory digests too.

This came out of my investigation of #19748, and, I think, eliminates
directories as potential cause of the issues there.
@huonw
Copy link
Contributor

huonw commented Sep 20, 2023

Pants' CI has seen (at least) two instances of a similar error recently, but both with small digests (a few KB):

Traceback (most recent call last):
  File "/Users/runner/work/pants/pants/src/python/pants/bin/daemon_pants_runner.py", line 134, in single_daemonized_run
    runner = LocalPantsRunner.create(
  File "/Users/runner/work/pants/pants/src/python/pants/bin/local_pants_runner.py", line 93, in create
    build_config = options_initializer.build_config(options_bootstrapper, env)
  File "/Users/runner/work/pants/pants/src/python/pants/init/options_initializer.py", line 153, in build_config
    return _initialize_build_configuration(self._plugin_resolver, options_bootstrapper, env)
  File "/Users/runner/work/pants/pants/src/python/pants/init/options_initializer.py", line 52, in _initialize_build_configuration
    working_set = plugin_resolver.resolve(options_bootstrapper, env, backends_requirements)
  File "/Users/runner/work/pants/pants/src/python/pants/init/plugin_resolver.py", line 141, in resolve
    for resolved_plugin_location in self._resolve_plugins(options_bootstrapper, env, request):
  File "/Users/runner/work/pants/pants/src/python/pants/init/plugin_resolver.py", line 166, in _resolve_plugins
    session.product_request(ResolvedPluginDistributions, [params])[0],
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/scheduler.py", line 579, in product_request
    return self.execute(request)
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/scheduler.py", line 520, in execute
    self._raise_on_error([t for _, t in throws])
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/scheduler.py", line 504, in _raise_on_error
    raise ExecutionError(
pants.engine.internals.scheduler.ExecutionError: 1 Exception encountered:

Engine traceback:
  in select
    ..
  in pants.init.plugin_resolver.resolve_plugins
    ..

Traceback (most recent call last):
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 638, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/init/plugin_resolver.py", line 77, in resolve_plugins
    plugins_pex = await Get(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 130, in __await__
    result = yield self
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 638, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/util_rules/pex.py", line 1041, in create_venv_pex
    venv_pex_result = await Get(BuildPexResult, PexRequest, seeded_venv_request)
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 130, in __await__
    result = yield self
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 638, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/util_rules/pex.py", line 720, in build_pex
    result = await Get(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 130, in __await__
    result = yield self
native_engine.IntrinsicError: Could not identify a process to backtrack to for: Missing digest: Was not present in the local store: Digest { hash: Fingerprint<b772cbdee71ef0c558056bc50c9cd0a076567cdc883aede5aa156d7ad6f180d4>, size_bytes: 1287 }
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/core/goals/test.py", line 874, in run_tests
    results = await MultiGet(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 361, in MultiGet
    return await _MultiGet(tuple(__arg0))
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 168, in __await__
    result = yield self.gets
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/goals/pytest_runner.py", line 505, in run_python_tests
    setup = await Get(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 118, in __await__
    result = yield self
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/goals/pytest_runner.py", line 256, in setup_pytest_for_target
    transitive_targets, plugin_setups = await MultiGet(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 512, in MultiGet
    return await _MultiGet((__arg0, __arg1))
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 168, in __await__
    result = yield self.gets
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/graph.py", line 749, in transitive_targets
    dependency_mapping = await Get(_DependencyMapping, _DependencyMappingRequest(request, True))
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 118, in __await__
    result = yield self
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/graph.py", line 699, in transitive_dependency_mapping
    direct_dependencies = await MultiGet(  # noqa: PNT30: this is inherently sequential
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 361, in MultiGet
    return await _MultiGet(tuple(__arg0))
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 168, in __await__
    result = yield self.gets
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/graph.py", line 1326, in resolve_dependencies
    inferred = await MultiGet(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 361, in MultiGet
    return await _MultiGet(tuple(__arg0))
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 168, in __await__
    result = yield self.gets
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 626, in native_engine_generator_send
    res = rule.send(arg) if err is None else rule.throw(throw or err)
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/dependency_inference/rules.py", line 460, in infer_python_dependencies_via_source
    parsed_dependencies = await _exec_parse_deps(request.field_set, python_setup)
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/dependency_inference/rules.py", line 365, in _exec_parse_deps
    resp = await Get(
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 118, in __await__
    result = yield self
native_engine.IntrinsicError: Could not identify a process to backtrack to for: Missing digest: Was not present in the local store: Digest { hash: Fingerprint<610fcb1e611f8f633f3f7dd85df614162b7ed3c767b06b0b3fd2fa698dcd4a9b>, size_bytes: 7902 }

@engnatha
Copy link
Author

Just reporting that we got this again on 2.18.0

@huonw huonw modified the milestones: 2.17.x, 2.18.x Nov 26, 2023
@trhodeos
Copy link
Contributor

I was seeing this on 2.18.0, but someone at our company realized that we were seeing a lot of "quarantines" by SentinelOne, which our company pre-installs on macs (not linux). I suspect it was seeing pants forking a lot of subprocesses, and thought it was a virus? Not 100% sure why it picked up on pants as a virus. It was doing this quarantining quietly which made it annoying to debug.

We "fixed" it by adding pants + related files to the exclusion list in SentinelOne about a week ago. I've haven't seen this since the "fix" went in, whereas before it'd happen pretty frequently on pants lint ::

@engnatha
Copy link
Author

Interesting. We also use sentinelone internally.

@benjyw
Copy link
Contributor

benjyw commented Jan 12, 2024

Aha! @engnatha can you report back here if this solves things?

@huonw huonw modified the milestones: 2.18.x, 2.19.x Mar 4, 2024
@huonw huonw modified the milestones: 2.19.x, 2.20.x Jun 4, 2024
@benjyw
Copy link
Contributor

benjyw commented Jun 4, 2024

Hi @engnatha , wondering if you have an update on this? I'm seeing something similar locally now and definitely not related to sentinelone...

@engnatha
Copy link
Author

engnatha commented Jun 6, 2024

We actually haven't seen this in a while. I don't believe our security team ever got around to enabling allow list exceptions for pants-managed files.

@huonw huonw removed this from the 2.20.x milestone Aug 1, 2024
@huonw huonw added this to the 2.21.x milestone Aug 1, 2024
@kuza55
Copy link

kuza55 commented Sep 10, 2024

I am also still seeing this on 2.22.0a0

12:48:30.91 [INFO] Starting: Building 50 requirements for requirements.pex from the 3rdparty/python/default.lock resolve: absl-py==2.1.0; python_version >= "3.7", aiohttp==3.10.5; python_version >= "3.8", altair==5.4.1; python_ve... (2515 characters truncated)
12:48:47.34 [INFO] Completed: Building 50 requirements for requirements.pex from the 3rdparty/python/default.lock resolve: absl-py==2.1.0; python_version >= "3.7", aiohttp==3.10.5; python_version >= "3.8", altair==5.4.1; python_ve... (2515 characters truncated)
12:48:47.39 [INFO] Starting: Building requirements_venv.pex
12:48:47.41 [INFO] Starting: Run setuptools_scm for libs:_version
12:48:48.05 [INFO] Completed: Run setuptools_scm for libs:_version
12:48:57.86 [INFO] Completed: Building requirements_venv.pex
12:48:57.94 �[31m[ERROR]�[0m 1 Exception encountered:

Engine traceback:
  in `check` goal

IntrinsicError: Could not identify a process to backtrack to for: Missing digest: Was not present in the local store: Digest { hash: Fingerprint<a8ad0b3f6dacaab4532190f101b3fb679ee128480ec8f22139e922461dc8b189>, size_bytes: 36 }

@kuza55
Copy link

kuza55 commented Sep 11, 2024

FYI pkill pantsd resolves this.

@tgolsson
Copy link
Contributor

Just observed this failure on CI on 2.22.0. Using remote cache. I wonder if build cancellation is a trigger here where we end up with a partial write of something/things?

@huonw huonw modified the milestones: 2.21.x, 2.22.x Nov 2, 2024
tgolsson added a commit that referenced this issue Nov 14, 2024
I don't think there's much information to be had here, fundamentally,
but on the off-chance there is it would be interesting.

Relates to: #19748, which
we've started seeing quite regularly at work.
@tgolsson
Copy link
Contributor

tgolsson commented Nov 19, 2024

Quite interesting stats for a failure on our CI... we made one cache check; missed, and then exploded.

native_engine.IntrinsicError: Could not identify a process to backtrack to for: Missing digest: Was not present in the local store: Digest { hash: Fingerprint<56d1d0fd32193264fdb2593cd72e46018c42a6de1a6b9e678b010dd6a94ace83>, size_bytes: 1169 }
 backtrack_attempts: 0
  docker_execution_errors: 0
  docker_execution_requests: 0
  docker_execution_successes: 0
  local_cache_read_errors: 0
  local_cache_requests: 1
  local_cache_requests_cached: 0
  local_cache_requests_uncached: 1
  local_cache_total_time_saved_ms: 0
  local_cache_write_errors: 0
  local_execution_requests: 0
  local_process_total_time_run_ms: 0
  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 0
  remote_cache_requests_cached: 0
  remote_cache_requests_uncached: 0
  remote_cache_speculation_local_completed_first: 0
  remote_cache_speculation_remote_completed_first: 0
  remote_cache_total_time_saved_ms: 0
  remote_cache_write_attempts: 0
  remote_cache_write_errors: 0
  remote_cache_write_successes: 0
  remote_execution_errors: 0
  remote_execution_requests: 0
  remote_execution_rpc_errors: 0
  remote_execution_rpc_execute: 0
  remote_execution_rpc_retries: 0
  remote_execution_rpc_wait_execution: 0
  remote_execution_success: 0
  remote_execution_timeouts: 0
  remote_process_total_time_run_ms: 0
  remote_store_exists_attempts: 0
  remote_store_exists_errors: 0
  remote_store_exists_successes: 0
  remote_store_missing_digest: 0
  remote_store_read_attempts: 0
  remote_store_read_cached: 0
  remote_store_read_errors: 0
  remote_store_read_uncached: 0
  remote_store_request_timeouts: 0
  remote_store_write_attempts: 0
  remote_store_write_errors: 0
  remote_store_write_successes: 0

And:

Summary of `local_store_read_blob_size` observation histogram:
  min: 1169
  max: 1169
  mean: 1169.000
  std dev: 0.000
  total observations: 1
  sum: 1169
  p25: 1169
  p50: 1169
  p75: 1169
  p90: 1169
  p95: 1169
  p99: 1169
Summary of `local_store_read_blob_time_micros` observation histogram:
  min: 60
  max: 60
  mean: 60.000
  std dev: 0.000
  total observations: 1
  sum: 60
  p25: 60
  p50: 60
  p75: 60
  p90: 60
  p95: 60
  p99: 60

So it seems to indicate that we did read 1169 bytes in 60 us, but we failed somewhere else? Also, this error message implies that we don't have a remote store; but we definitely have one configured...

      let remote = &self.remote.clone().ok_or_else(|| {
            StoreError::MissingDigest(
                "Was not present in the local store".to_owned(),
                *missing_file_digests.iter().next().unwrap(),
            )
        })?;

This seems like a smoking gun to me. @huonw you did a bunch of refactors of this code a while back (around the same time I added mtls support) - any obvious things to check from your end? Or do you think this is an expected error/red herring?

@huonw huonw modified the milestones: 2.22.x, 2.23.x Dec 17, 2024
@huonw huonw modified the milestones: 2.23.x, 2.24.x Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants