Protect against partial actions in the cache #6

alercah · 2018-06-05T15:36:41Z

Currently, when an execution fails, we allow returning a partially-populated ActionResult message. RBE experienced a failure where these partial results were making it into the cache, and because the exit_code field was unpopulated, it was interpreted as 0. Subsequent builds read the result from the cache and interpreted it as successful, but since it had no output files the build failed later when the requisite file was not present. It's certainly believable that a failure like this could reoccur in RBE or in other implementations, so we would like to put protections in place against it.

As I understand it, Bazel's architecture does easily lend itself to having outputs be mandatory, which is why it can only detect the failure downstream. This is why all outputs are considered optional at the API level; even trying to separate out optional and mandatory outputs on the Bazel side might prove difficult.

One suggestion was to require that all action results have at least one output file or directory to be considered valid; an action that has no meaningful output files could add a dummy output and touch it on the bot side (or even include it as an input) to ensure that the empty ActionResult is not propagated.

The text was updated successfully, but these errors were encountered:

ola-rozenfeld · 2018-06-11T15:41:58Z

To clarify, this was an RBE bug which imo is against the API spec. The API states that if an action was not executed, the server returns a failure status in the ExecuteResponse field. Obviously, whenever this status is not OK is considered an action failure and should not populate the remote cache (unlike what we accidentally did).

But amending the API to additionally consider output-less actions as failures may be a good idea regardless.

bergsieker · 2019-06-03T20:36:29Z

Adding to list of issues to consider for v3. Not clear what the right answer is, but we can consider it at that time.

mostynb · 2019-11-08T20:16:16Z

I can think of one scenario where an ActionResult with no output apart from the exit code might be useful: sanity checks.

Maybe in v3 it would be sufficient to require that the ActionResult's execution_metadata field has at least one non-default field value? Setting the worker name would be the simplest way to implement this.

mostynb · 2019-11-12T13:33:30Z

Actually, maybe we can advise that the backend implement this behaviour in REAPIv2: when receiving an ActionResult that does not have the worker set in the metadata, set it to some non-default value (such as the IP address of the client)? This allows for non-zero filesize checks on the backend, and we can consider making it compulsory for the client to set this field in REAPIv3 (which would then allow us to catch some client and transmission errors).

(I'm trialing such a workaround in bazel-remote at the moment.)

* Add a starlark deps function that can be used by repos depending on this to load the necessary dependencies.

sluongng · 2023-11-28T10:19:39Z

One of our customers run into this situation recently and it's unclear to me where the problem should be fixed: on the client side (Bazel), on the API spec (this repo), on the Server side, or a combination of the above.

The problematic code in Bazel is over at https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java;l=1266-1288;drc=3e79472690126689304c714711d911395db3a278. In most action implementation in Bazel, all declared outputs in Command would be treated as mandatory. But in some special native actions (CppCompile, JavaCompile, etc..) some output paths would be considered as "optional". This "optional/mandatory" property is definitely not described in the current API spec and thus, is not known to the server side, but the client side does apply a validation over the ActionResult and yell Invalid action cache entry if some of the "mandatory" were missing.

Our customers have some internal setup where the remote action is executed via multiple layers of wrappers. In some infrastructure changes (i.e. server rotation), one of these wrappers handles SIGTERM in a way that it would exit 0, signaling a successful action, without producing the outputs that the client deemed to be mandatory. Because of exit code zero, our server treated this as a valid ActionResult and recorded the partial outputs into AC. However, the users would experience confusing errors from Bazel as they would see it as Invalid action cache entry and question the cache integrity.

Our current solution is to advise the customer to fix their wrapper to produce a non-zero exit code in case of SIGTERM. However, I think we should discuss what is the right solution here moving forward. A couple of options to consider:

Client and Server both agree that all outputs listed in Command are Mandatory when exit code zero is given. This way, the server could perform extra validation over the missing outputs and avoid invalid writing ActionResults into AC that the client won't be able to accept. This means that the client is responsible for ensuring all actions need to create corresponding outputs (i.e. create an empty file if the file is previously "optional").
The client and Server both agree that optional outputs are possible. ActionResult validity is judged solely based on the exit code equal to zero. The client needs to make sure that when exit code zero, the output files are correct. Subsequent downstream action failures for missing inputs (as previous outputs were not given) should be handled on the client side.

IMO (1) will provide a better user experience, as the server could provide verbose errors for the client regarding missing outputs. But it will take a bit more effort to implement and enforce. (2) is quick and easy, as it's closer to the current state. But it delegates more responsibilities toward build rules authors and end-users, causing worse UX.

cc: @EdSchouten for V3 consideration.

tjgq · 2023-11-28T19:40:45Z

@sluongng As far as I can tell, there's no distinction between "mandatory" and "optional" outputs in Bazel. The declared outputs of a spawn are exactly the ones we populate Command.output_paths with (see here). My reading of the spec is that the ActionResult.output_* fields are required to be a subset of Command.output_paths. Therefore, the code you linked to would seem to imply that we treat every output as mandatory. I don't expect native actions to invalidate this, because the implementation is the same as far as remote caching/execution is concerned; can you clarify in what way C++/Java actions differ?

I do agree that, if we are to always treat outputs as mandatory, it's problematic for a remote implementation to cache action results where some of the declared outputs are missing. Unless my analysis above is wrong and we do have a use for optional outputs in Bazel, I'd strongly prefer (1).

EdSchouten · 2023-11-28T19:44:47Z

I don't really understand what the issue is here.

If an infrastructure failure occurred while running the action, Execute() will return an ExecuteResponse that has a non-null status. Even if exit_code is set to zero in that case, clients can distinguish that from normal process termination.
In the case of GetActionResult() there is no ExecuteResponse. You get the bare ActionResult back. But why would a remote execution system write anything into the Action Cache in case of an infrastructure failure? It is completely valid to just drop it on the floor. My assumption was that any reasonable implementation would already do that.

The ActionResult you get back simply describes which files were present after the action ran. It is up to the client to determine whether those results pass any client-imposed restrictions.

sluongng · 2023-11-28T19:56:32Z

@tjgq The default implementation for isMandatoryOutput in Spawn.java could be overridden by classes that implement that interface. For example: SimpleSpawn.java (which is used in CcCompile?) and JavaCompileAction.java (used in JavaCompile?) both override it to set some output paths to be optional.

@EdSchouten

But why would a remote execution system write anything into the Action Cache in case of an infrastructure failure?

I think this is the key question here.

From our server perspective, there was no infra failure. Our workers were gracefully shutting down and the action on said worker exited with code zero after given SIGTERM. We interpreted a zero exit code, generated by a user's provided wrapper, as a successful action. Hence we wrote the Action Result to AC.

If we could agree that a user action exits with code zero, but does not produce all the listed outputs in Command, is a fail action, then we could implement that check (for missing outputs) on the worker side and avoid writing that Action Result to AC. However, this is not how some Bazel native actions are currently interpreting things, so we would need to fix those native actions accordingly before rolling out such a validation scheme on the worker side.

EdSchouten · 2023-11-28T20:20:42Z

If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful.

If workers send SIGTERM to an action, they should already record some state in their bookkeeping that they did not allow the action to run to completion, and should therefore not write anything to the AC, regardless of the exit code returned by the build action.

tjgq · 2023-11-28T21:17:31Z

@sluongng Apologies; you're completely right. We do have a notion of optional outputs (for Java and C++). I don't feel like I have a solid understanding of why they're necessary at this time, but in the absence of one, I'm going to assume they exist for $important_reasons, and we can't (easily) get rid of them.

Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs. A client might later interpret a missing output as an error, but that shouldn't compromise the ability to reuse the action result (it would just deterministically lead to the same error at the client level).

But, to Ed's point, under the current definition of success, you do need to ensure that infrastructure failures always result in a nonzero exit code or a non-success status (which seems completely orthogonal to the discussion of mandatory vs. optional outputs).

sluongng · 2023-11-29T09:29:37Z

If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful.

This is a good point. I guess we could consider all commands that are interrupted by infrastructure reasons to be retriable failures, regardless of the exit code.

However, this is only one of the possible causes that could yield this "Invalid action cache entry" error.
A user-provided command could fail to produce some requested outputs because of other reasons and then exit with code zero as well. Such "Action Cache Entry" would be considered invalid by some clients (i.e. Bazel).

Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs.

I think we should codify the definition of success in the proto documentation:

A zero exit code, regardless of actual outputs matching requested outputs in Command.
OR a zero exit code AND all requested outputs in Command are available.

Right now, I would assume that most server implementations are going with (1).
However, Bazel is performing client-side validation for (2) in most cases.

If we were to agree upon (1), which is what I am picking up from @tjgq's message, then we could have a separate discussion in Bazel regarding guidance on user-provided build rules.

tjgq · 2023-11-29T09:56:17Z

A user-provided command could fail to produce some requested outputs because of other reasons and then exit with code zero as well. Such "Action Cache Entry" would be considered invalid by some clients (i.e. Bazel).

Again, under the assumption of determinism, I don't necessarily see this as a problem: Bazel would consider the action result invalid, but that's fine because you'd have to change something in the Command to produce a valid result (as opposed to just rerunning it and expecting a different result). It does place a burden on users (which in the Bazel case means not just Bazel itself, but also rule authors) to ensure all error conditions produce a nonzero exit, but that burden already exists irrespective of mandatory outputs.

I think we should codify the definition of success in the proto documentation:

To be completely clear, what "success" means here is whether the action result is allowed to be cached, right? (i.e., a nonzero exit would still result in an ExecutionResponse.status == OK?)

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

sluongng · 2023-11-29T09:59:08Z

Agree. Option 3 seems like a great solution :)

Is it possible to include it in REv2 or is it disruptive and have to wait for REv3 (as proposed in some earlier comments in this issue a few years ago)?

EdSchouten · 2023-11-29T10:02:36Z

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway. Otherwise they would crash on a null pointer dereference/KeyError exception/...

tjgq · 2023-11-29T10:05:30Z

How about simply codifying that an implementation MUST NOT cache an ActionResult with exit_code != 0? Would that still be too disruptive for v2? (I feel like this is a pretty glaring omission - caching nonzero exits makes it very easy for a flaky test to poison a cache!)

sluongng · 2023-11-29T10:07:43Z

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway.

The key differentiation here is for the server to know how the client would validate the ActionResult.
If server could validate the result in the same way client does, then storing the result in AC would be relatively safe.

If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX.

EdSchouten · 2023-11-29T10:17:08Z

The issue is that it's virtually impossible for the server to validate the ActionResult. File existence may not be sufficient. For example, I've seen cases where programs terminate with exit zero, even though they were only halfway done. Even worse: I have seen workers with memory corruption, yielding output files containing bit flips. In order for a server to detect those cases based on generated outputs, you need something far stronger than simply checking whether a file exists. Regex matching of file contents? Invocation of a separate validation tool? It all leads to a path with no clear outcome.

If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX.

With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten. That said, if we really want to offer some kind of special API for removing AC entries, this is worth discussing.

sluongng · 2023-11-29T10:43:18Z

For the sake of easier demonstration of the issue, I have created an example repo here https://github.com/sluongng/bb-rbe-test/blob/sluongng/invalid-ac/BUILD

genrule(
    name = "temp",
    outs = [
        "1.txt",
        "2.txt",
    ],
    cmd = """
    touch $(execpath 1.txt)
    exit 0
    """,
    exec_properties = {
        # "workload-isolation-type": "firecracker",
    },
)

Building this remotely with our server currently will make Bazel pew out

Invalid action cache entry ec4b980964439d5ca71081dc87a30e7a881e66e7af9bb9d6924cbf012edcf1dc: expected output 2.txt does not exist.

However, this ActionResult is written into our AC for subsequent builds to re-use because the exit code was zero.

With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten.

Wouldn't a rerun yield the same (error-nous) cached result from AC? Or are you suggesting that the client (user) should fix their build rules, generate a new Command, and then execute it in a new Action? Perhaps I am missing something here.

tjgq · 2023-11-29T10:51:58Z

Or are you suggesting that the client (user) should fix their build rules, generate a new Command, and then execute it in a new Action?

This is the point I'm making above. Fixing the genrule to produce 2.txt would result in a distinct Command and would not hit the existing AC entry.

But Invalid action cache entry is a really bad error message (it makes it sound like a programmer error, but it's actually a rule author error). I think it should say something along the lines of mandatory output 2.txt was not produced by action execution.

EdSchouten · 2023-11-29T11:10:23Z

Exactly what @tjgq says. :-)

The AC entry that is created in case of the erroneous action is when taken in isolation valid. You're only touching 1.txt, so you get an ActionResult that only contains 1.txt. The AC entry, though not very useful, will in no way conflict with newer versions of the rule that do touch the correct set of files.

sluongng · 2023-11-29T11:14:12Z

These are valuable feedbacks with great clarity. 🙏
Much appreciated.

Do you think it's worth keeping this issue open for a V3 discussion? or should we close it?

tjgq · 2023-11-29T11:16:09Z

All things considered, I'm comfortable with the status quo of allowing partial outputs and still caching the respective results. But I do think we need to codify that results with exit_code != 0 must not be cached (otherwise, a lot of things would probably break). I'm hoping that's small enough for v2, and added it to the agenda for the next REAPI meeting.

EdSchouten · 2023-11-29T11:20:57Z

@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with exit_code != 0, but some people outside of Bazel explicitly asked for this. (Buildstream? Not sure.)

Buildbarn still uses the historical behaviour of only writing AC entries in the case of exit_code == 0. The reason being that I write those AC entries into the CAS instead, inside a message named HistoricalExecuteResponse. This allows me to obtain a stable link to a historical build failure for people to inspect.

tjgq · 2023-11-29T12:17:34Z

@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with exit_code != 0, but some people outside of Bazel explicitly asked for this. (Buildstream? Not sure.)

Oh, I wasn't aware of that bit of history. I suppose we could at best make it configurable, then? (e.g. an ExecuteRequest.no_cache_exit_nonzero field)? I think this behavior is pretty much required to use Bazel effectively: since we use the exit code to determine test success, caching nonzero exits means flaky tests can poison the cache. (I'm less sure that it matters for non-test actions.)

…an ActionResult. The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c

…ng from an ActionResult. The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c

…ng from an ActionResult. (#20380) The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java). The current error message makes it sound like a programmer error rather than a user or rules author error. See also bazelbuild/remote-apis#6 for the discussion that prompted this fix. PiperOrigin-RevId: 586625265 Change-Id: I8846614917c82eff87c8495696e55b80c096c02c

bergsieker added this to the Remote Execution API v2 milestone Mar 12, 2019

EricBurnett mentioned this issue May 22, 2019

not all outputs were created or valid (Remote caching, all platforms) bazelbuild/bazel#8125

Closed

bergsieker modified the milestones: Remote Execution API v2, Remote Execution API v3 Jun 3, 2019

EricBurnett mentioned this issue Apr 20, 2020

V3 idea: No longer allow Digest.size_bytes <= 0 #134

Open

santigl pushed a commit to santigl/remote-apis that referenced this issue Aug 26, 2020

Add a starlark deps function. (bazelbuild#6)

f6c7942

* Add a starlark deps function that can be used by repos depending on this to load the necessary dependencies.

tjgq mentioned this issue Nov 30, 2023

[7.0.0] Emit a clearer error message when a mandatory output is missing from an ActionResult. bazelbuild/bazel#20380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protect against partial actions in the cache #6

Protect against partial actions in the cache #6

alercah commented Jun 5, 2018 •

edited

Loading

ola-rozenfeld commented Jun 11, 2018

bergsieker commented Jun 3, 2019

mostynb commented Nov 8, 2019

mostynb commented Nov 12, 2019

sluongng commented Nov 28, 2023

tjgq commented Nov 28, 2023 •

edited

Loading

EdSchouten commented Nov 28, 2023 •

edited

Loading

sluongng commented Nov 28, 2023

EdSchouten commented Nov 28, 2023 •

edited

Loading

tjgq commented Nov 28, 2023 •

edited

Loading

sluongng commented Nov 29, 2023

tjgq commented Nov 29, 2023 •

edited

Loading

sluongng commented Nov 29, 2023

EdSchouten commented Nov 29, 2023 •

edited

Loading

tjgq commented Nov 29, 2023 •

edited

Loading

sluongng commented Nov 29, 2023

EdSchouten commented Nov 29, 2023

sluongng commented Nov 29, 2023

tjgq commented Nov 29, 2023

EdSchouten commented Nov 29, 2023 •

edited

Loading

sluongng commented Nov 29, 2023

tjgq commented Nov 29, 2023 •

edited

Loading

EdSchouten commented Nov 29, 2023

tjgq commented Nov 29, 2023

Protect against partial actions in the cache #6

Protect against partial actions in the cache #6

Comments

alercah commented Jun 5, 2018 • edited Loading

ola-rozenfeld commented Jun 11, 2018

bergsieker commented Jun 3, 2019

mostynb commented Nov 8, 2019

mostynb commented Nov 12, 2019

sluongng commented Nov 28, 2023

tjgq commented Nov 28, 2023 • edited Loading

EdSchouten commented Nov 28, 2023 • edited Loading

sluongng commented Nov 28, 2023

EdSchouten commented Nov 28, 2023 • edited Loading

tjgq commented Nov 28, 2023 • edited Loading

sluongng commented Nov 29, 2023

tjgq commented Nov 29, 2023 • edited Loading

sluongng commented Nov 29, 2023

EdSchouten commented Nov 29, 2023 • edited Loading

tjgq commented Nov 29, 2023 • edited Loading

sluongng commented Nov 29, 2023

EdSchouten commented Nov 29, 2023

sluongng commented Nov 29, 2023

tjgq commented Nov 29, 2023

EdSchouten commented Nov 29, 2023 • edited Loading

sluongng commented Nov 29, 2023

tjgq commented Nov 29, 2023 • edited Loading

EdSchouten commented Nov 29, 2023

tjgq commented Nov 29, 2023

alercah commented Jun 5, 2018 •

edited

Loading

tjgq commented Nov 28, 2023 •

edited

Loading

EdSchouten commented Nov 28, 2023 •

edited

Loading

EdSchouten commented Nov 28, 2023 •

edited

Loading

tjgq commented Nov 28, 2023 •

edited

Loading

tjgq commented Nov 29, 2023 •

edited

Loading

EdSchouten commented Nov 29, 2023 •

edited

Loading

tjgq commented Nov 29, 2023 •

edited

Loading

EdSchouten commented Nov 29, 2023 •

edited

Loading

tjgq commented Nov 29, 2023 •

edited

Loading