Dynamic remote execution and two-phase caching #155

erikmav · 2020-07-25T01:19:40Z

For two-phase caching and dynamic action execution.

…ction execution

EdSchouten · 2020-07-27T07:48:53Z

build/bazel/remote/execution/v2/remote_execution.proto

+
+  // Additional statistics for dynamic execution via
+  // [ExecuteWithDynamicRetrieval][build.bazel.remote.execution.v2.Execution.ExecuteWithDynamicRetrieval].
+  DynamicExecutionMetadata dynamic_execution = 11;


Would it make sense to use #154 for this?

What's here were the most-likely-interesting stats cut down from ~40 we send back right now. In our iteration toward this API I was going to use the DynamicExecutionMetadata.extended_metadata Any field for the remainder. I can move all of this over to the #154 Any field, but then there's no common schema, just a bag of key-value pairs with a convention for the key names that might differ between vendors.

I'll keep this open and merge with 154 if/when it goes in.

EdSchouten · 2020-07-27T07:50:09Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // One or more written strings. There is an implicit operating system
+    // dependent newline after each string. Each string may contain embedded
+    // OS-dependent newline character sequences.
+    repeated string writes = 2;


What's the use of using repeated string here? Couldn't multiple strings just use separate ConsoleOutput messages? Alternatively: concatenate them if they all belong to {stdout,stderr}?

The intent was for batching: One stdout|err boolean that leads to one or more lines of output. The most common case is for there to be a lot of stdout output and no stderr, which would result in a ConsoleOut->isStdErr=false + array-of-lines.

I was avoiding requiring concatenation in the implementation. AnyBuild's implementation does not concatenate (but could). If you feel strongly about requiring it does simplify the proto at the cost of the implementation having to concatenate.

(note any further discussion here needs to be reflected or copied into #173)

build/bazel/remote/execution/v2/remote_execution.proto

EdSchouten · 2020-07-27T08:01:01Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // execution with a PRECONDITION_FAILED error, or it MAY return a
+    // file-not-found return code to the executing `Action` file read request
+    // which may cause the action to fail.
+    bool could_not_upload = 2;


Is there really a need to communicate this back? Couldn't the client just cancel the execution request?

There probably should be a way for the client to signal that the client encountered an error reading the file or uploading it to the CAS. I would prefer to only have one option for behaviour here, rather than a choice. I also think I'd prefer it to be an execution failure, since that's what would happen if the client knew that the file was a needed input: it would have to abort upload.

Error handling is probably important, but I think can be removed from these individual message (could_not_upload, file_does_not_exist, etc), and instead pulled out into its own error message? Something the client can send down the stream to indicate a fatal termination state - some way for the client to return a status ("FAILED_PRECONDITION, INVALID_ARGUMENT, etc) plus embedded error information, before they close the stream. Possibly gRPC has this already (client-side errors in bidi streams?), in which case nothing special is requred here, beyond documenting the stream-terminating states a client may get into (e.g. when the server requests a file that the client doesn't have or didn't tell it about, the client fails the stream with INVALID_ARGUMENT and provides X metadata for logging...)

build/bazel/remote/execution/v2/remote_execution.proto

alercah · 2020-07-28T02:45:24Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // another logically equivalent action if they hash differently.
+  //
+  // Returns a stream of
+  // [google.longrunning.Operation][google.longrunning.Operation] messages


I think it may be worth considering discarding the name field on all but the first Operation. It might be a bit of a stretch, but a chatty protocol that returns the name with every single message is likely just wasting bandwidth (though I'm uncertain how much, as it would be easily compressed).

alercah · 2020-07-28T02:48:47Z

build/bazel/remote/execution/v2/remote_execution.proto

@@ -108,12 +109,126 @@ service Execution {
    option (google.api.http) = { post: "/v2/{instance_name=**}/actions:execute" body: "*" };
  }

+  // Executes an action remotely, allowing dynamic retrieval of directory and


Generally not a fan of duplication of the docs here; I think it should explain the differences between this and Execute rather than letting the developer work those out.

alercah · 2020-07-28T02:49:10Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // Executes an action remotely, allowing dynamic retrieval of directory and
+  // file inputs from the client where the client did not upload the inputs
+  // beforehand. The client must remain connected to the worker during the
+  // duration of the action execution.


I think resumption could be supported with additional effort; a semi-stateful protocol could be more useful there.

alercah · 2020-07-28T02:50:30Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // referring to them. The worker will run the action and eventually return
+  // the result.
+  //
+  // The `Action.input_root_digest` directory tree provides a filesystem


Given that you also refer to partial trees later on in the dynamic upload protocol, I think it may be worth moving discussion of partial trees elsewhere such as to the Directory or node messages.

alercah · 2020-07-28T02:52:42Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // strings other than the root SHOULD result in an `INVALID_ARGUMENT`
+  // error.
+  //
+  // This API is bidirectional both from the client to server and from server


In #157, I proposed using uploads to the CAS as the trigger for the server to consider execution unblocked. This won't quite work for the full dynamic protocol you propose here, but perhaps there could be a way to do that that would avoid needing a separate bidi rpc (and possibly an additional bidi resumption rpc too, if that's desired)? For instance, adding an rpc that allows the client to specify an ongoing operation and "fill in" its input tree.

alercah · 2020-07-28T03:33:17Z

build/bazel/remote/execution/v2/remote_execution.proto

+      FILE_READ = 0;
+
+      // `Path` is a directory that was enumerated by the action using
+      // `DirectorySearchPattern`. If the directory is missing, this pathset


Why does DIRECTORY_ENUMERATION use DirectorySearchPattern, but DIRECTORY_SEARCH not?

DIR_SEARCH is not a regular enumeration, it's a combination of one or more specific filenames (i.e. no wildcards involved). This is typical for C++ compilers, which (in most implementations) call the filesystem enumeration APIs with exactly the filename of the header/include desired. For example, if a .cpp contains:

#include "foo.h"
#include "bar.h"

And the search paths include "dir1;dir2" the filesystem will typically see the following sequence of directory enumerations:

dir1: enum "foo.h" (it's not here)
dir2: enum "foo.h" (it's here)
dir1: enum "bar.h" (it's here so file probing stops)

In the ObservedPathSet this becomes:
Entries:
{ path="dir1", kind=DIR_SEARCH }
{ path="dir2", kind=DIR_SEARCH }
observed_accessed_file_names = [ "foo", "bar" ]

It's important in these cases to ensure that the presence or absence of foo(.h) and bar(.h) in both dir1 and dir2 are added to the observed input digest as that affects correctness of the cache hits in the face of such directory searching/probing.

Is this a Windows-specific API? Why wouldn't the accessed file names be "foo" and "bar"? Is some equivalent of stat not used to query if a directory contains a file with a specific name?

stat() equivalence on Windows is inexact and apps do it differently. If you have an open file handle there's GetFileInformationByHandle() but the most common case when querying directory metadata is to start a dir enumeration with FindFirstFile() and following FindNextFile(), but using an exact name string like "foo.h" instead of using wildcard characters.

On Linux the sandbox would see a stat() call for a specific name but the resulting ObservedPathSet result is the same: A file search for a specific name, and tracking whether the file was present or not in the directory, to feed the summary hash to use for comparison with what the worker saw from its sandboxed API calls.

@alercah I was avoiding adding such an example into the doc comment text but I don't see a great way to add in more information. I did add a link in the doc comment for ObservedPathSet to the BuildXL discussion markdown for reference; do you think that URL would be usefully repeated on the per-field comments as well to aid understanding?

Ah I see from the comment above that you found that doc impenetrable. Let me create a better one or modify the existing and see if that works better.

alercah · 2020-07-28T03:42:12Z

build/bazel/remote/execution/v2/remote_execution.proto

+      // the hash of the string constant "P" must be added to the
+      // observed-input digest.
+      ABSENT_FILE_OR_DIRECTORY_PROBE = 5;
+    }


This approach to specifying file operations seems underspecified. There are a number of properties, like the contents of FileProperties and is_executable, which could be distinguished under this API but which aren't encoded here. On unix, probably the correct thing to do is encode that a stat() call was made and list all of the queried properties.

I'm unsure about what should be done about properties that could be in FileProperty and would fall outside that, which I imagine could include Linux extended ACLs. I guess this should also accept extension operations, then.

Discussing with the sandboxing devs. My first response would be that file presence and hash are sufficient, but maybe we've missed a nuance here.

build/bazel/remote/execution/v2/remote_execution.proto

alercah · 2020-07-28T03:45:52Z

build/bazel/remote/execution/v2/remote_execution.proto

+// (underspecified) inputs to a set of outputs.
+//
+// When comparing to the filesystem, a corresponding observed-input digest is
+// initialized. Each pathset entry is examined against the filesystem, and


initialized how?

build/bazel/remote/execution/v2/remote_execution.proto

EricBurnett

I'll put higher level feedback in #156 , just responding to the detailed syntax/semantics here.

build/bazel/remote/execution/v2/remote_execution.proto

EricBurnett · 2020-07-28T16:34:44Z

build/bazel/remote/execution/v2/remote_execution.proto

+  //   [DynamicExecutionSupport.NOT_SUPPORTED][build.bazel.remote.execution.v2.DynamicExecutionSupport.NOT_SUPPORTED]
+  //   or [DynamicExecutionSupport.UNKNOWN][build.bazel.remote.execution.v2.DynamicExecutionSupport.UNKNOWN]
+  //   is set in the [ExecutionCapabilities][build.bazel.remote.execution.v2.ExecutionCapabilities].
+  rpc ExecuteWithDynamicRetrieval(stream StreamedDynamicExecutionData) returns (stream google.longrunning.Operation) {


We should probably just get rid of Operation here...the main point of Operation is to have a standard-ish way to give a handle to a running execution, for resumption across RPCs. But above, you've forbidden resumption, in which case Operation is an even poorer fit than it already is for Execute. Probably just better to have our own message?

(Though as Alexis commented above, if you want to open the possibility of resumption APIs down the line, you may need to keep the operation name).

EricBurnett · 2020-07-28T16:37:12Z

build/bazel/remote/execution/v2/remote_execution.proto

+  //   that [TwoPhaseCacheSupport.NOT_SUPPORTED][build.bazel.remote.execution.v2.TwoPhaseCacheSupport.NOT_SUPPORTED]
+  //   or [TwoPhaseCacheSupport.UNKNOWN][build.bazel.remote.execution.v2.TwoPhaseCacheSupport.UNKNOWN]
+  //   is set in the [CacheCapabilities][build.bazel.remote.execution.v2.CacheCapabilities].
+  rpc GetPotentialActionResults(GetPotentialActionResultsRequest) returns (GetPotentialActionResultsResponse) {


It'd be nice if the names for ExecuteWithDynamicRetrieval and GetPotentialActionResults were clearly paired. Rename to GetDynamicActionResults?

build/bazel/remote/execution/v2/remote_execution.proto

EricBurnett · 2020-07-30T14:05:17Z

build/bazel/remote/execution/v2/remote_execution.proto

+// cached outputs.
+message GetPotentialActionResultsResponse {
+  // A set of possible matches to the client filesystem along with a reference to build outputs.
+  repeated ObservedPathSetAndInputDigest potential_action_matches = 1;


Require sorting by pathset (under some scheme), so a pathset with multiple candidates can be checked at once? Or use a scheme like (repeated pathset -> (repeated digest->digest pairs)), since it's likely one pathset will have multiple candidates?

EricBurnett · 2020-07-30T14:10:57Z

build/bazel/remote/execution/v2/remote_execution.proto

+// [ActionResult][build.bazel.remote.execution.v2.ActionResult] should be
+// retrieved and used.
+//
+// Generating a directory hash: Create a list of the names of directory members


It seems a little odd to me to have a new scheme for generating hashes here. We do so elsewhere in the API by building a merkle tree, and you've already detailed above the generalization required to make a partial merkle tree with unspecified nodes and whatnot. Couldn't we instead say something like "update the partial merkle tree with the hashes/contents of the given files/directories", and then re-take the merkle tree hash as the new digest to match against? That way there's no new scheme to learn/implement, only effectively updating the client-predicted paths with an oracle's suggested paths.

EricBurnett · 2020-07-30T14:12:56Z

build/bazel/remote/execution/v2/remote_execution.proto

+// a.h and b.h the list string is "a.h,b.h". Convert this string to uppercase.
+// Hash the UTF-8 string bytes.
+//
+// Discussion of how to generate pathsets from action execution filesystem


It can be reasonable to link out to extra content, but this API must stand alone without it (e.g. when that link eventually goes stale). Please make sure a reader without access to this wiki page will have all necessary information to figure out the API here.

Alexis had a comment about how the algo works (above), and also indicated this content is not understandable without a lot more context. Do you think adding in examples to the doc comments works?

build/bazel/remote/execution/v2/remote_execution.proto

…more batchy

…disconnection

erikmav · 2021-07-13T14:17:59Z

Closing this PR and related issue: Though interesting for some community members, this proposal is unlikely to be accepted into the RE protocol (even for v3) as it needs a second implementation.

AnyBuild remote execution additions for 2-phase caching and dynamic a…

bcaad59

…ction execution

erikmav requested review from agoulti, bergsieker, buchgr and ola-rozenfeld as code owners July 25, 2020 01:19

googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Jul 25, 2020

This was referenced Jul 25, 2020

Two-phase caching and dynamic remote execution proposal #156

Closed

Two-Phase Cache and AnyBuild 2-way remote exec proposals - draft for MS internal erikmav/remote-apis#1

Closed

V3 idea: No longer allow Digest.size_bytes <= 0 #134

Open

erikmav changed the title ~~AnyBuild remote execution additions~~ Dynamic remote execution and two-phase caching Jul 25, 2020

EdSchouten reviewed Jul 27, 2020

View reviewed changes

alercah mentioned this pull request Jul 28, 2020

Action pipelinining and automatic queuing #157

Open

alercah reviewed Jul 28, 2020

View reviewed changes

EricBurnett requested changes Jul 30, 2020

View reviewed changes

santigl pushed a commit to santigl/remote-apis that referenced this pull request Aug 26, 2020

Upgrading to grpc-go v1.30.0 (bazelbuild#155)

6f04825

erikmav mentioned this pull request Sep 8, 2020

Add optional support for interleaved stdout/err outputs #173

Closed

erikmav added 14 commits September 8, 2020 16:22

Update ConsoleOutputs to align with broken-out PR bazelbuild#173

e177d6b

Remove enums for capabilities in favor of bools with default false

f64af9f

Use enum instead of bool for stderr/out stream

91a423f

Make dynamic directory, file hash, file content requests from worker …

5c49019

…more batchy

Remove action_digest from streamed responses

efff4de

Fix casing for ObservedPathSet fields

3f84e01

Use MUST for empty string usage in dir Merkle tree for VFS

a8d47cb

Remove missing input as an example for FAILED_PRECONDIITON

de1e29c

Use Digest in file content request to client, and clarify response flow

2e78e08

Rename directory metadata -> directory content to reduce confusion

1083251

Reduce strength of wording in relation to 2-way execution and client …

9bfdc96

…disconnection

Change complicated message name to PotentialActionMatch

631b0c7

Clarify file hash uploads in directory contents

1b86e33

Fix field ordinals

34675df

erikmav added 2 commits March 10, 2021 20:34

Merge branch 'master' into dev/erikmav/AnyBuildRE

4d7b3f0

Align directory hash info with updated implementation

02864a2

erikmav closed this Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic remote execution and two-phase caching #155

Dynamic remote execution and two-phase caching #155

erikmav commented Jul 25, 2020

EdSchouten Jul 27, 2020

erikmav Sep 9, 2020

EdSchouten Jul 27, 2020

erikmav Sep 8, 2020 •

edited

Loading

erikmav Sep 8, 2020

EdSchouten Jul 27, 2020

alercah Jul 28, 2020

EricBurnett Jul 28, 2020

alercah Jul 28, 2020

alercah Jul 28, 2020

alercah Jul 28, 2020

alercah Jul 28, 2020

alercah Jul 28, 2020

alercah Jul 28, 2020

erikmav Sep 8, 2020

alercah Sep 8, 2020

erikmav Sep 8, 2020

erikmav Sep 9, 2020 •

edited

Loading

erikmav Sep 9, 2020

alercah Jul 28, 2020

erikmav Sep 9, 2020

alercah Jul 28, 2020

EricBurnett left a comment

EricBurnett Jul 28, 2020

EricBurnett Jul 28, 2020

EricBurnett Jul 30, 2020

EricBurnett Jul 30, 2020

EricBurnett Jul 30, 2020

erikmav Sep 9, 2020

erikmav commented Jul 13, 2021 •

edited

Loading

Dynamic remote execution and two-phase caching #155

Dynamic remote execution and two-phase caching #155

Conversation

erikmav commented Jul 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikmav Sep 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikmav Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricBurnett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikmav commented Jul 13, 2021 • edited Loading

erikmav Sep 8, 2020 •

edited

Loading

erikmav Sep 9, 2020 •

edited

Loading

erikmav commented Jul 13, 2021 •

edited

Loading