Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect against partial actions in the cache #6

Open
alercah opened this issue Jun 5, 2018 · 24 comments
Open

Protect against partial actions in the cache #6

alercah opened this issue Jun 5, 2018 · 24 comments

Comments

@alercah
Copy link
Contributor

alercah commented Jun 5, 2018

Currently, when an execution fails, we allow returning a partially-populated ActionResult message. RBE experienced a failure where these partial results were making it into the cache, and because the exit_code field was unpopulated, it was interpreted as 0. Subsequent builds read the result from the cache and interpreted it as successful, but since it had no output files the build failed later when the requisite file was not present. It's certainly believable that a failure like this could reoccur in RBE or in other implementations, so we would like to put protections in place against it.

As I understand it, Bazel's architecture does easily lend itself to having outputs be mandatory, which is why it can only detect the failure downstream. This is why all outputs are considered optional at the API level; even trying to separate out optional and mandatory outputs on the Bazel side might prove difficult.

One suggestion was to require that all action results have at least one output file or directory to be considered valid; an action that has no meaningful output files could add a dummy output and touch it on the bot side (or even include it as an input) to ensure that the empty ActionResult is not propagated.

@ola-rozenfeld
Copy link
Contributor

To clarify, this was an RBE bug which imo is against the API spec. The API states that if an action was not executed, the server returns a failure status in the ExecuteResponse field. Obviously, whenever this status is not OK is considered an action failure and should not populate the remote cache (unlike what we accidentally did).

But amending the API to additionally consider output-less actions as failures may be a good idea regardless.

@bergsieker
Copy link
Collaborator

Adding to list of issues to consider for v3. Not clear what the right answer is, but we can consider it at that time.

@mostynb
Copy link
Contributor

mostynb commented Nov 8, 2019

I can think of one scenario where an ActionResult with no output apart from the exit code might be useful: sanity checks.

Maybe in v3 it would be sufficient to require that the ActionResult's execution_metadata field has at least one non-default field value? Setting the worker name would be the simplest way to implement this.

@mostynb
Copy link
Contributor

mostynb commented Nov 12, 2019

Actually, maybe we can advise that the backend implement this behaviour in REAPIv2: when receiving an ActionResult that does not have the worker set in the metadata, set it to some non-default value (such as the IP address of the client)? This allows for non-zero filesize checks on the backend, and we can consider making it compulsory for the client to set this field in REAPIv3 (which would then allow us to catch some client and transmission errors).

(I'm trialing such a workaround in bazel-remote at the moment.)

santigl pushed a commit to santigl/remote-apis that referenced this issue Aug 26, 2020
* Add a starlark deps function that can be used by repos depending on this to load the necessary dependencies.
@sluongng
Copy link
Contributor

One of our customers run into this situation recently and it's unclear to me where the problem should be fixed: on the client side (Bazel), on the API spec (this repo), on the Server side, or a combination of the above.

The problematic code in Bazel is over at https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java;l=1266-1288;drc=3e79472690126689304c714711d911395db3a278. In most action implementation in Bazel, all declared outputs in Command would be treated as mandatory. But in some special native actions (CppCompile, JavaCompile, etc..) some output paths would be considered as "optional". This "optional/mandatory" property is definitely not described in the current API spec and thus, is not known to the server side, but the client side does apply a validation over the ActionResult and yell Invalid action cache entry if some of the "mandatory" were missing.

Our customers have some internal setup where the remote action is executed via multiple layers of wrappers. In some infrastructure changes (i.e. server rotation), one of these wrappers handles SIGTERM in a way that it would exit 0, signaling a successful action, without producing the outputs that the client deemed to be mandatory. Because of exit code zero, our server treated this as a valid ActionResult and recorded the partial outputs into AC. However, the users would experience confusing errors from Bazel as they would see it as Invalid action cache entry and question the cache integrity.

Our current solution is to advise the customer to fix their wrapper to produce a non-zero exit code in case of SIGTERM. However, I think we should discuss what is the right solution here moving forward. A couple of options to consider:

  1. Client and Server both agree that all outputs listed in Command are Mandatory when exit code zero is given. This way, the server could perform extra validation over the missing outputs and avoid invalid writing ActionResults into AC that the client won't be able to accept. This means that the client is responsible for ensuring all actions need to create corresponding outputs (i.e. create an empty file if the file is previously "optional").

  2. The client and Server both agree that optional outputs are possible. ActionResult validity is judged solely based on the exit code equal to zero. The client needs to make sure that when exit code zero, the output files are correct. Subsequent downstream action failures for missing inputs (as previous outputs were not given) should be handled on the client side.

IMO (1) will provide a better user experience, as the server could provide verbose errors for the client regarding missing outputs. But it will take a bit more effort to implement and enforce. (2) is quick and easy, as it's closer to the current state. But it delegates more responsibilities toward build rules authors and end-users, causing worse UX.

cc: @EdSchouten for V3 consideration.

@tjgq
Copy link
Contributor

tjgq commented Nov 28, 2023

@sluongng As far as I can tell, there's no distinction between "mandatory" and "optional" outputs in Bazel. The declared outputs of a spawn are exactly the ones we populate Command.output_paths with (see here). My reading of the spec is that the ActionResult.output_* fields are required to be a subset of Command.output_paths. Therefore, the code you linked to would seem to imply that we treat every output as mandatory. I don't expect native actions to invalidate this, because the implementation is the same as far as remote caching/execution is concerned; can you clarify in what way C++/Java actions differ?

I do agree that, if we are to always treat outputs as mandatory, it's problematic for a remote implementation to cache action results where some of the declared outputs are missing. Unless my analysis above is wrong and we do have a use for optional outputs in Bazel, I'd strongly prefer (1).

@EdSchouten
Copy link
Collaborator

EdSchouten commented Nov 28, 2023

I don't really understand what the issue is here.

  • If an infrastructure failure occurred while running the action, Execute() will return an ExecuteResponse that has a non-null status. Even if exit_code is set to zero in that case, clients can distinguish that from normal process termination.
  • In the case of GetActionResult() there is no ExecuteResponse. You get the bare ActionResult back. But why would a remote execution system write anything into the Action Cache in case of an infrastructure failure? It is completely valid to just drop it on the floor. My assumption was that any reasonable implementation would already do that.

The ActionResult you get back simply describes which files were present after the action ran. It is up to the client to determine whether those results pass any client-imposed restrictions.

@sluongng
Copy link
Contributor

@tjgq The default implementation for isMandatoryOutput in Spawn.java could be overridden by classes that implement that interface. For example: SimpleSpawn.java (which is used in CcCompile?) and JavaCompileAction.java (used in JavaCompile?) both override it to set some output paths to be optional.

@EdSchouten

But why would a remote execution system write anything into the Action Cache in case of an infrastructure failure?

I think this is the key question here.

From our server perspective, there was no infra failure. Our workers were gracefully shutting down and the action on said worker exited with code zero after given SIGTERM. We interpreted a zero exit code, generated by a user's provided wrapper, as a successful action. Hence we wrote the Action Result to AC.

If we could agree that a user action exits with code zero, but does not produce all the listed outputs in Command, is a fail action, then we could implement that check (for missing outputs) on the worker side and avoid writing that Action Result to AC. However, this is not how some Bazel native actions are currently interpreting things, so we would need to fix those native actions accordingly before rolling out such a validation scheme on the worker side.

@EdSchouten
Copy link
Collaborator

EdSchouten commented Nov 28, 2023

If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful.

If workers send SIGTERM to an action, they should already record some state in their bookkeeping that they did not allow the action to run to completion, and should therefore not write anything to the AC, regardless of the exit code returned by the build action.

@tjgq
Copy link
Contributor

tjgq commented Nov 28, 2023

@sluongng Apologies; you're completely right. We do have a notion of optional outputs (for Java and C++). I don't feel like I have a solid understanding of why they're necessary at this time, but in the absence of one, I'm going to assume they exist for $important_reasons, and we can't (easily) get rid of them.

Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs. A client might later interpret a missing output as an error, but that shouldn't compromise the ability to reuse the action result (it would just deterministically lead to the same error at the client level).

But, to Ed's point, under the current definition of success, you do need to ensure that infrastructure failures always result in a nonzero exit code or a non-success status (which seems completely orthogonal to the discussion of mandatory vs. optional outputs).

@sluongng
Copy link
Contributor

If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful.

This is a good point. I guess we could consider all commands that are interrupted by infrastructure reasons to be retriable failures, regardless of the exit code.

However, this is only one of the possible causes that could yield this "Invalid action cache entry" error.
A user-provided command could fail to produce some requested outputs because of other reasons and then exit with code zero as well. Such "Action Cache Entry" would be considered invalid by some clients (i.e. Bazel).

Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs.

I think we should codify the definition of success in the proto documentation:

  1. A zero exit code, regardless of actual outputs matching requested outputs in Command.

  2. OR a zero exit code AND all requested outputs in Command are available.

Right now, I would assume that most server implementations are going with (1).
However, Bazel is performing client-side validation for (2) in most cases.

If we were to agree upon (1), which is what I am picking up from @tjgq's message, then we could have a separate discussion in Bazel regarding guidance on user-provided build rules.

@tjgq
Copy link
Contributor

tjgq commented Nov 29, 2023

A user-provided command could fail to produce some requested outputs because of other reasons and then exit with code zero as well. Such "Action Cache Entry" would be considered invalid by some clients (i.e. Bazel).

Again, under the assumption of determinism, I don't necessarily see this as a problem: Bazel would consider the action result invalid, but that's fine because you'd have to change something in the Command to produce a valid result (as opposed to just rerunning it and expecting a different result). It does place a burden on users (which in the Bazel case means not just Bazel itself, but also rule authors) to ensure all error conditions produce a nonzero exit, but that burden already exists irrespective of mandatory outputs.

I think we should codify the definition of success in the proto documentation:

To be completely clear, what "success" means here is whether the action result is allowed to be cached, right? (i.e., a nonzero exit would still result in an ExecutionResponse.status == OK?)

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

@sluongng
Copy link
Contributor

Agree. Option 3 seems like a great solution :)

Is it possible to include it in REv2 or is it disruptive and have to wait for REv3 (as proposed in some earlier comments in this issue a few years ago)?

@EdSchouten
Copy link
Collaborator

EdSchouten commented Nov 29, 2023

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway. Otherwise they would crash on a null pointer dereference/KeyError exception/...

@tjgq
Copy link
Contributor

tjgq commented Nov 29, 2023

How about simply codifying that an implementation MUST NOT cache an ActionResult with exit_code != 0? Would that still be too disruptive for v2? (I feel like this is a pretty glaring omission - caching nonzero exits makes it very easy for a flaky test to poison a cache!)

@sluongng
Copy link
Contributor

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway.

The key differentiation here is for the server to know how the client would validate the ActionResult.
If server could validate the result in the same way client does, then storing the result in AC would be relatively safe.

If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX.

@EdSchouten
Copy link
Collaborator

The issue is that it's virtually impossible for the server to validate the ActionResult. File existence may not be sufficient. For example, I've seen cases where programs terminate with exit zero, even though they were only halfway done. Even worse: I have seen workers with memory corruption, yielding output files containing bit flips. In order for a server to detect those cases based on generated outputs, you need something far stronger than simply checking whether a file exists. Regex matching of file contents? Invocation of a separate validation tool? It all leads to a path with no clear outcome.

If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX.

With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten. That said, if we really want to offer some kind of special API for removing AC entries, this is worth discussing.

@sluongng
Copy link
Contributor

For the sake of easier demonstration of the issue, I have created an example repo here https://github.com/sluongng/bb-rbe-test/blob/sluongng/invalid-ac/BUILD

genrule(
    name = "temp",
    outs = [
        "1.txt",
        "2.txt",
    ],
    cmd = """
    touch $(execpath 1.txt)
    exit 0
    """,
    exec_properties = {
        # "workload-isolation-type": "firecracker",
    },
)

Building this remotely with our server currently will make Bazel pew out

Invalid action cache entry ec4b980964439d5ca71081dc87a30e7a881e66e7af9bb9d6924cbf012edcf1dc: expected output 2.txt does not exist.

However, this ActionResult is written into our AC for subsequent builds to re-use because the exit code was zero.

With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten.

Wouldn't a rerun yield the same (error-nous) cached result from AC? Or are you suggesting that the client (user) should fix their build rules, generate a new Command, and then execute it in a new Action? Perhaps I am missing something here.

@tjgq
Copy link
Contributor

tjgq commented Nov 29, 2023

Or are you suggesting that the client (user) should fix their build rules, generate a new Command, and then execute it in a new Action?

This is the point I'm making above. Fixing the genrule to produce 2.txt would result in a distinct Command and would not hit the existing AC entry.

But Invalid action cache entry is a really bad error message (it makes it sound like a programmer error, but it's actually a rule author error). I think it should say something along the lines of mandatory output 2.txt was not produced by action execution.

@EdSchouten
Copy link
Collaborator

EdSchouten commented Nov 29, 2023

Exactly what @tjgq says. :-)

The AC entry that is created in case of the erroneous action is when taken in isolation valid. You're only touching 1.txt, so you get an ActionResult that only contains 1.txt. The AC entry, though not very useful, will in no way conflict with newer versions of the rule that do touch the correct set of files.

@sluongng
Copy link
Contributor

These are valuable feedbacks with great clarity. 🙏
Much appreciated.

Do you think it's worth keeping this issue open for a V3 discussion? or should we close it?

@tjgq
Copy link
Contributor

tjgq commented Nov 29, 2023

All things considered, I'm comfortable with the status quo of allowing partial outputs and still caching the respective results. But I do think we need to codify that results with exit_code != 0 must not be cached (otherwise, a lot of things would probably break). I'm hoping that's small enough for v2, and added it to the agenda for the next REAPI meeting.

@EdSchouten
Copy link
Collaborator

@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with exit_code != 0, but some people outside of Bazel explicitly asked for this. (Buildstream? Not sure.)

Buildbarn still uses the historical behaviour of only writing AC entries in the case of exit_code == 0. The reason being that I write those AC entries into the CAS instead, inside a message named HistoricalExecuteResponse. This allows me to obtain a stable link to a historical build failure for people to inspect.

@tjgq
Copy link
Contributor

tjgq commented Nov 29, 2023

@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with exit_code != 0, but some people outside of Bazel explicitly asked for this. (Buildstream? Not sure.)

Oh, I wasn't aware of that bit of history. I suppose we could at best make it configurable, then? (e.g. an ExecuteRequest.no_cache_exit_nonzero field)? I think this behavior is pretty much required to use Bazel effectively: since we use the exit code to determine test success, caching nonzero exits means flaky tests can poison the cache. (I'm less sure that it matters for non-test actions.)

copybara-service bot pushed a commit to bazelbuild/bazel that referenced this issue Nov 30, 2023
…an ActionResult.

The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java).

The current error message makes it sound like a programmer error rather than a user or rules author error.

See also bazelbuild/remote-apis#6 for the discussion that prompted this fix.

PiperOrigin-RevId: 586625265
Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
tjgq added a commit to tjgq/bazel that referenced this issue Nov 30, 2023
…ng from an ActionResult.

The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java).

The current error message makes it sound like a programmer error rather than a user or rules author error.

See also bazelbuild/remote-apis#6 for the discussion that prompted this fix.

PiperOrigin-RevId: 586625265
Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
tjgq added a commit to tjgq/bazel that referenced this issue Nov 30, 2023
…ng from an ActionResult.

The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java).

The current error message makes it sound like a programmer error rather than a user or rules author error.

See also bazelbuild/remote-apis#6 for the discussion that prompted this fix.

PiperOrigin-RevId: 586625265
Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
tjgq added a commit to tjgq/bazel that referenced this issue Nov 30, 2023
…ng from an ActionResult.

The remote execution spec allows an action to succeed, but produce only a subset of its declared outputs, so Bazel must verify that outputs marked as mandatory have been produced. Outputs are always mandatory except for a few specialized native actions (C++ and Java).

The current error message makes it sound like a programmer error rather than a user or rules author error.

See also bazelbuild/remote-apis#6 for the discussion that prompted this fix.

PiperOrigin-RevId: 586625265
Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
keertk pushed a commit to bazelbuild/bazel that referenced this issue Nov 30, 2023
…ng from an ActionResult. (#20380)

The remote execution spec allows an action to succeed, but produce only
a subset of its declared outputs, so Bazel must verify that outputs
marked as mandatory have been produced. Outputs are always mandatory
except for a few specialized native actions (C++ and Java).

The current error message makes it sound like a programmer error rather
than a user or rules author error.

See also bazelbuild/remote-apis#6 for the
discussion that prompted this fix.

PiperOrigin-RevId: 586625265
Change-Id: I8846614917c82eff87c8495696e55b80c096c02c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants