-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pants run
and pants package
failing with backtrack due to missing digest
#19748
Comments
Not sure if #16667 is related, but we're using |
@engnatha confirmed on slack (https://pantsbuild.slack.com/archives/C046T6T9U/p1693541874862319?thread_ts=1693439136.489919&cid=C046T6T9U) they use a remote cache ( I looked over the diff between 2.16.0 and 2.17.0 https://github.com/pantsbuild/pants/compare/release_2.16.0..release_2.17.0 , and particularly the changes to the "core" caching code: From my skim, it looks like the vaguely-interesting changes are:
As a weak bit of evidence, the digests mentioned in the logs have size 14208017, 607533, 4098329, 1132584, 639039, all of which are larger than the 524288 (512KiB) Full log of changes to those files
Some of these were likely cherry-picked to 2.16.0. I haven't filtered them out. 8da4c8d Have Pants create the |
The FSDB changes might be a red herring, FWIW https://pantsbuild.slack.com/archives/C046T6T9U/p1672247391875929 was in Dec 2022 with a file of 35 bytes. And another in December: https://pantsbuild.slack.com/archives/C046T6T9U/p1671161642895819?thread_ts=1666215970.400249&cid=C046T6T9U The FSDB changes were merged 5 months ago. This particular thread from October seems relevant: https://pantsbuild.slack.com/archives/C0D7TNJHL/p1665440874737089?thread_ts=1665440874.737089&cid=C0D7TNJHL and that led to #17204 I'll CC @stuhood and it'll be a waiting game to see who has time first 😈 |
The fact that this fix went into 2.14, and for 2.15 and 2.16 we don't see it again does hint to me that the FSDB likely either unearthed another bug, or somehow re-introduces the previous one |
Does With |
I have two theories:
2 is most concrete. However, I believe the check only kicks in for loading a directory from the remote, and while it doesn't seem impossible that pub struct Directory {
pub files: Vec<FileNode>,
pub directories: Vec<DirectoryNode>,
pub symlinks: Vec<SymlinkNode>,
pub node_properties: Option<NodeProperties>,
} (That said, even if 2 is unlikely, it definitely seems like a latent bug, and so should be fixed.) |
Hm, I tried constructing a failing example for this, and couldn't: it seems like Pants will only upload (Structs are references to protobufs in The example I was working with which creates a 1MB+ "directory" using some really long file names, which I ran with [GLOBAL]
pants_version = "2.17.0"
backend_packages = ["pants.backend.shell"]
remote_store_address = "grpc://localhost:9092"
remote_cache_read = true
remote_cache_write = true
pantsd = false
# bump this to force a fresh local cache:
local_store_dir = "/tmp/19748-backtrack-8" shell_command(
name="generate",
command="""
mkdir -p foo/bar
for i in $(seq 1 5000); do
touch foo/bar/long-12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890-$i
done
""",
tools=["mkdir", "touch", "seq"],
output_directories=["foo/"],
)
shell_command(
name="invalidator",
# bump this to force the archive process to rebuild
command="echo 5 > file.txt",
output_files=["file.txt"]
)
archive(name="consume", files=[":generate", ":invalidator"], format="zip") |
(Turns out directories are never stored in FSDB, so they're probably not a cause of the issue here, i.e. theory 2 in #19748 (comment) is wrong, and my reproduction attempt above was never going to fail. #19815 makes this more obvious.) |
This adjusts `RemoteStore::download_digest_to_local` to enforce that the `f_remote` validation function is never provided for digests that should be stored in FSDB. This is no-functionality-change, just refactoring things to be slightly clearer and more future-proof. Previously, this code was conditionalising whether something should be stored in FSDB on _more_ than just the type/size, which _could_ have led to bugs: if we were downloading a digest that should be stored in FSDB, but pass `f_remote = Some(...)`, `download_digest_to_local` would instead store it in LMDB, and future look-ups of the digest (which don't know about `f_remote`) would look in FSDB and fail to find it. However, in practice, we only ever pass `f_remote` in `Store::load_directory`, and directories are never stored in the FSDB, even if they're huge, so the extra condition was not changing behaviour. This PR is making the assumption/behaviour/luck here more explicit, and adds a test validating the behaviour of huge directory digests too. This came out of my investigation of #19748, and, I think, eliminates directories as potential cause of the issues there.
Pants' CI has seen (at least) two instances of a similar error recently, but both with small digests (a few KB):
|
Just reporting that we got this again on |
I was seeing this on We "fixed" it by adding pants + related files to the exclusion list in SentinelOne about a week ago. I've haven't seen this since the "fix" went in, whereas before it'd happen pretty frequently on |
Interesting. We also use sentinelone internally. |
Aha! @engnatha can you report back here if this solves things? |
Hi @engnatha , wondering if you have an update on this? I'm seeing something similar locally now and definitely not related to sentinelone... |
We actually haven't seen this in a while. I don't believe our security team ever got around to enabling allow list exceptions for pants-managed files. |
I am also still seeing this on 2.22.0a0
|
FYI |
Just observed this failure on CI on 2.22.0. Using remote cache. I wonder if build cancellation is a trigger here where we end up with a partial write of something/things? |
I don't think there's much information to be had here, fundamentally, but on the off-chance there is it would be interesting. Relates to: #19748, which we've started seeing quite regularly at work.
Quite interesting stats for a failure on our CI... we made one cache check; missed, and then exploded.
And:
So it seems to indicate that we did read 1169 bytes in 60 us, but we failed somewhere else? Also, this error message implies that we don't have a remote store; but we definitely have one configured...
This seems like a smoking gun to me. @huonw you did a bunch of refactors of this code a while back (around the same time I added mtls support) - any obvious things to check from your end? Or do you think this is an expected error/red herring? |
Describe the bug
A little hard to capture exactly what was happening. In our CI environment, we were getting a lot of intermittent failures with a similar traceback about a failure to find a process to backtrack for a "missing digest". A repro is elusive here as we couldn't replicate on user accounts.
Here's a sample traceback from when we were had things configured to package a pex binary and then run that binary.
An identical error was raised when we ran things with just a
pants run
command and avoidedpackage
altogether.Pants version
2.17.0 with the 0.10.0 launcher binary
OS
Ubuntu 20.04
Additional info
Interestingly, at the recommendation of @huonw , we tried prepending
PANTS_LOCAL_STORE_DIR=$(mktemp -d) <cmd>
and things worked repeatedly enough when force scheduling builds.This led us to believe that wiping the cache on that runner would make things happy. Unfortunately, the next build after resetting the cache resulted in the same errors above.
The text was updated successfully, but these errors were encountered: