-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect against partial actions in the cache #6
Comments
To clarify, this was an RBE bug which imo is against the API spec. The API states that if an action was not executed, the server returns a failure status in the ExecuteResponse field. Obviously, whenever this status is not OK is considered an action failure and should not populate the remote cache (unlike what we accidentally did). But amending the API to additionally consider output-less actions as failures may be a good idea regardless. |
Adding to list of issues to consider for v3. Not clear what the right answer is, but we can consider it at that time. |
I can think of one scenario where an ActionResult with no output apart from the exit code might be useful: sanity checks. Maybe in v3 it would be sufficient to require that the ActionResult's execution_metadata field has at least one non-default field value? Setting the worker name would be the simplest way to implement this. |
Actually, maybe we can advise that the backend implement this behaviour in REAPIv2: when receiving an ActionResult that does not have the worker set in the metadata, set it to some non-default value (such as the IP address of the client)? This allows for non-zero filesize checks on the backend, and we can consider making it compulsory for the client to set this field in REAPIv3 (which would then allow us to catch some client and transmission errors). (I'm trialing such a workaround in bazel-remote at the moment.) |
* Add a starlark deps function that can be used by repos depending on this to load the necessary dependencies.
One of our customers run into this situation recently and it's unclear to me where the problem should be fixed: on the client side (Bazel), on the API spec (this repo), on the Server side, or a combination of the above. The problematic code in Bazel is over at https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java;l=1266-1288;drc=3e79472690126689304c714711d911395db3a278. In most action implementation in Bazel, all declared outputs in Command would be treated as mandatory. But in some special native actions (CppCompile, JavaCompile, etc..) some output paths would be considered as "optional". This "optional/mandatory" property is definitely not described in the current API spec and thus, is not known to the server side, but the client side does apply a validation over the ActionResult and yell Our customers have some internal setup where the remote action is executed via multiple layers of wrappers. In some infrastructure changes (i.e. server rotation), one of these wrappers handles SIGTERM in a way that it would Our current solution is to advise the customer to fix their wrapper to produce a non-zero exit code in case of SIGTERM. However, I think we should discuss what is the right solution here moving forward. A couple of options to consider:
IMO (1) will provide a better user experience, as the server could provide verbose errors for the client regarding missing outputs. But it will take a bit more effort to implement and enforce. (2) is quick and easy, as it's closer to the current state. But it delegates more responsibilities toward build rules authors and end-users, causing worse UX. cc: @EdSchouten for V3 consideration. |
@sluongng As far as I can tell, there's no distinction between "mandatory" and "optional" outputs in Bazel. The declared outputs of a spawn are exactly the ones we populate I do agree that, if we are to always treat outputs as mandatory, it's problematic for a remote implementation to cache action results where some of the declared outputs are missing. Unless my analysis above is wrong and we do have a use for optional outputs in Bazel, I'd strongly prefer (1). |
I don't really understand what the issue is here.
The ActionResult you get back simply describes which files were present after the action ran. It is up to the client to determine whether those results pass any client-imposed restrictions. |
@tjgq The default implementation for
I think this is the key question here. From our server perspective, there was no infra failure. Our workers were gracefully shutting down and the action on said worker exited with code zero after given SIGTERM. We interpreted a zero exit code, generated by a user's provided wrapper, as a successful action. Hence we wrote the Action Result to AC. If we could agree that a user action exits with code zero, but does not produce all the listed outputs in Command, is a fail action, then we could implement that check (for missing outputs) on the worker side and avoid writing that Action Result to AC. However, this is not how some Bazel native actions are currently interpreting things, so we would need to fix those native actions accordingly before rolling out such a validation scheme on the worker side. |
If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful. If workers send SIGTERM to an action, they should already record some state in their bookkeeping that they did not allow the action to run to completion, and should therefore not write anything to the AC, regardless of the exit code returned by the build action. |
@sluongng Apologies; you're completely right. We do have a notion of optional outputs (for Java and C++). I don't feel like I have a solid understanding of why they're necessary at this time, but in the absence of one, I'm going to assume they exist for $important_reasons, and we can't (easily) get rid of them. Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs. A client might later interpret a missing output as an error, but that shouldn't compromise the ability to reuse the action result (it would just deterministically lead to the same error at the client level). But, to Ed's point, under the current definition of success, you do need to ensure that infrastructure failures always result in a nonzero exit code or a non-success status (which seems completely orthogonal to the discussion of mandatory vs. optional outputs). |
This is a good point. I guess we could consider all commands that are interrupted by infrastructure reasons to be retriable failures, regardless of the exit code. However, this is only one of the possible causes that could yield this "Invalid action cache entry" error.
I think we should codify the definition of success in the proto documentation:
Right now, I would assume that most server implementations are going with (1). If we were to agree upon (1), which is what I am picking up from @tjgq's message, then we could have a separate discussion in Bazel regarding guidance on user-provided build rules. |
Again, under the assumption of determinism, I don't necessarily see this as a problem: Bazel would consider the action result invalid, but that's fine because you'd have to change something in the
To be completely clear, what "success" means here is whether the action result is allowed to be cached, right? (i.e., a nonzero exit would still result in an For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the |
Agree. Option 3 seems like a great solution :) Is it possible to include it in REv2 or is it disruptive and have to wait for REv3 (as proposed in some earlier comments in this issue a few years ago)? |
I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway. Otherwise they would crash on a null pointer dereference/KeyError exception/... |
How about simply codifying that an implementation MUST NOT cache an |
The key differentiation here is for the server to know how the client would validate the ActionResult. If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX. |
The issue is that it's virtually impossible for the server to validate the ActionResult. File existence may not be sufficient. For example, I've seen cases where programs terminate with exit zero, even though they were only halfway done. Even worse: I have seen workers with memory corruption, yielding output files containing bit flips. In order for a server to detect those cases based on generated outputs, you need something far stronger than simply checking whether a file exists. Regex matching of file contents? Invocation of a separate validation tool? It all leads to a path with no clear outcome.
With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten. That said, if we really want to offer some kind of special API for removing AC entries, this is worth discussing. |
For the sake of easier demonstration of the issue, I have created an example repo here https://github.com/sluongng/bb-rbe-test/blob/sluongng/invalid-ac/BUILD
Building this remotely with our server currently will make Bazel pew out
However, this ActionResult is written into our AC for subsequent builds to re-use because the exit code was zero.
Wouldn't a rerun yield the same (error-nous) cached result from AC? Or are you suggesting that the client (user) should fix their build rules, generate a new |
This is the point I'm making above. Fixing the But |
Exactly what @tjgq says. :-) The AC entry that is created in case of the erroneous action is when taken in isolation valid. You're only touching 1.txt, so you get an ActionResult that only contains 1.txt. The AC entry, though not very useful, will in no way conflict with newer versions of the rule that do touch the correct set of files. |
These are valuable feedbacks with great clarity. 🙏 Do you think it's worth keeping this issue open for a V3 discussion? or should we close it? |
All things considered, I'm comfortable with the status quo of allowing partial outputs and still caching the respective results. But I do think we need to codify that results with |
@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with Buildbarn still uses the historical behaviour of only writing AC entries in the case of |
Oh, I wasn't aware of that bit of history. I suppose we could at best make it configurable, then? (e.g. an |
…an ActionResult. The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
…ng from an ActionResult. The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
…ng from an ActionResult. The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
…ng from an ActionResult. The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
…ng from an ActionResult. (#20380) The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
Currently, when an execution fails, we allow returning a partially-populated
ActionResult
message. RBE experienced a failure where these partial results were making it into the cache, and because theexit_code
field was unpopulated, it was interpreted as 0. Subsequent builds read the result from the cache and interpreted it as successful, but since it had no output files the build failed later when the requisite file was not present. It's certainly believable that a failure like this could reoccur in RBE or in other implementations, so we would like to put protections in place against it.As I understand it, Bazel's architecture does easily lend itself to having outputs be mandatory, which is why it can only detect the failure downstream. This is why all outputs are considered optional at the API level; even trying to separate out optional and mandatory outputs on the Bazel side might prove difficult.
One suggestion was to require that all action results have at least one output file or directory to be considered valid; an action that has no meaningful output files could add a dummy output and touch it on the bot side (or even include it as an input) to ensure that the empty
ActionResult
is not propagated.The text was updated successfully, but these errors were encountered: