-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic remote execution and two-phase caching #155
Conversation
|
||
// Additional statistics for dynamic execution via | ||
// [ExecuteWithDynamicRetrieval][build.bazel.remote.execution.v2.Execution.ExecuteWithDynamicRetrieval]. | ||
DynamicExecutionMetadata dynamic_execution = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use #154 for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's here were the most-likely-interesting stats cut down from ~40 we send back right now. In our iteration toward this API I was going to use the DynamicExecutionMetadata.extended_metadata Any field for the remainder. I can move all of this over to the #154 Any field, but then there's no common schema, just a bag of key-value pairs with a convention for the key names that might differ between vendors.
I'll keep this open and merge with 154 if/when it goes in.
// One or more written strings. There is an implicit operating system | ||
// dependent newline after each string. Each string may contain embedded | ||
// OS-dependent newline character sequences. | ||
repeated string writes = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use of using repeated string
here? Couldn't multiple strings just use separate ConsoleOutput
messages? Alternatively: concatenate them if they all belong to {stdout,stderr}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was for batching: One stdout|err boolean that leads to one or more lines of output. The most common case is for there to be a lot of stdout output and no stderr, which would result in a ConsoleOut->isStdErr=false + array-of-lines.
I was avoiding requiring concatenation in the implementation. AnyBuild's implementation does not concatenate (but could). If you feel strongly about requiring it does simplify the proto at the cost of the implementation having to concatenate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(note any further discussion here needs to be reflected or copied into #173)
// execution with a PRECONDITION_FAILED error, or it MAY return a | ||
// file-not-found return code to the executing `Action` file read request | ||
// which may cause the action to fail. | ||
bool could_not_upload = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really a need to communicate this back? Couldn't the client just cancel the execution request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There probably should be a way for the client to signal that the client encountered an error reading the file or uploading it to the CAS. I would prefer to only have one option for behaviour here, rather than a choice. I also think I'd prefer it to be an execution failure, since that's what would happen if the client knew that the file was a needed input: it would have to abort upload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling is probably important, but I think can be removed from these individual message (could_not_upload, file_does_not_exist, etc), and instead pulled out into its own error message? Something the client can send down the stream to indicate a fatal termination state - some way for the client to return a status ("FAILED_PRECONDITION, INVALID_ARGUMENT, etc) plus embedded error information, before they close the stream. Possibly gRPC has this already (client-side errors in bidi streams?), in which case nothing special is requred here, beyond documenting the stream-terminating states a client may get into (e.g. when the server requests a file that the client doesn't have or didn't tell it about, the client fails the stream with INVALID_ARGUMENT and provides X metadata for logging...)
// another logically equivalent action if they hash differently. | ||
// | ||
// Returns a stream of | ||
// [google.longrunning.Operation][google.longrunning.Operation] messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be worth considering discarding the name
field on all but the first Operation
. It might be a bit of a stretch, but a chatty protocol that returns the name with every single message is likely just wasting bandwidth (though I'm uncertain how much, as it would be easily compressed).
@@ -108,12 +109,126 @@ service Execution { | |||
option (google.api.http) = { post: "/v2/{instance_name=**}/actions:execute" body: "*" }; | |||
} | |||
|
|||
// Executes an action remotely, allowing dynamic retrieval of directory and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally not a fan of duplication of the docs here; I think it should explain the differences between this and Execute
rather than letting the developer work those out.
// Executes an action remotely, allowing dynamic retrieval of directory and | ||
// file inputs from the client where the client did not upload the inputs | ||
// beforehand. The client must remain connected to the worker during the | ||
// duration of the action execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think resumption could be supported with additional effort; a semi-stateful protocol could be more useful there.
// referring to them. The worker will run the action and eventually return | ||
// the result. | ||
// | ||
// The `Action.input_root_digest` directory tree provides a filesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you also refer to partial trees later on in the dynamic upload protocol, I think it may be worth moving discussion of partial trees elsewhere such as to the Directory or node messages.
// strings other than the root SHOULD result in an `INVALID_ARGUMENT` | ||
// error. | ||
// | ||
// This API is bidirectional both from the client to server and from server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #157, I proposed using uploads to the CAS as the trigger for the server to consider execution unblocked. This won't quite work for the full dynamic protocol you propose here, but perhaps there could be a way to do that that would avoid needing a separate bidi rpc (and possibly an additional bidi resumption rpc too, if that's desired)? For instance, adding an rpc that allows the client to specify an ongoing operation and "fill in" its input tree.
FILE_READ = 0; | ||
|
||
// `Path` is a directory that was enumerated by the action using | ||
// `DirectorySearchPattern`. If the directory is missing, this pathset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does DIRECTORY_ENUMERATION use DirectorySearchPattern, but DIRECTORY_SEARCH not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DIR_SEARCH is not a regular enumeration, it's a combination of one or more specific filenames (i.e. no wildcards involved). This is typical for C++ compilers, which (in most implementations) call the filesystem enumeration APIs with exactly the filename of the header/include desired. For example, if a .cpp contains:
#include "foo.h"
#include "bar.h"
And the search paths include "dir1;dir2" the filesystem will typically see the following sequence of directory enumerations:
dir1: enum "foo.h" (it's not here)
dir2: enum "foo.h" (it's here)
dir1: enum "bar.h" (it's here so file probing stops)
In the ObservedPathSet this becomes:
Entries:
{ path="dir1", kind=DIR_SEARCH }
{ path="dir2", kind=DIR_SEARCH }
observed_accessed_file_names = [ "foo", "bar" ]
It's important in these cases to ensure that the presence or absence of foo(.h) and bar(.h) in both dir1 and dir2 are added to the observed input digest as that affects correctness of the cache hits in the face of such directory searching/probing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a Windows-specific API? Why wouldn't the accessed file names be "foo" and "bar"? Is some equivalent of stat
not used to query if a directory contains a file with a specific name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stat() equivalence on Windows is inexact and apps do it differently. If you have an open file handle there's GetFileInformationByHandle() but the most common case when querying directory metadata is to start a dir enumeration with FindFirstFile() and following FindNextFile(), but using an exact name string like "foo.h" instead of using wildcard characters.
On Linux the sandbox would see a stat() call for a specific name but the resulting ObservedPathSet result is the same: A file search for a specific name, and tracking whether the file was present or not in the directory, to feed the summary hash to use for comparison with what the worker saw from its sandboxed API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alercah I was avoiding adding such an example into the doc comment text but I don't see a great way to add in more information. I did add a link in the doc comment for ObservedPathSet to the BuildXL discussion markdown for reference; do you think that URL would be usefully repeated on the per-field comments as well to aid understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see from the comment above that you found that doc impenetrable. Let me create a better one or modify the existing and see if that works better.
// the hash of the string constant "P" must be added to the | ||
// observed-input digest. | ||
ABSENT_FILE_OR_DIRECTORY_PROBE = 5; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach to specifying file operations seems underspecified. There are a number of properties, like the contents of FileProperties
and is_executable
, which could be distinguished under this API but which aren't encoded here. On unix, probably the correct thing to do is encode that a stat()
call was made and list all of the queried properties.
I'm unsure about what should be done about properties that could be in FileProperty
and would fall outside that, which I imagine could include Linux extended ACLs. I guess this should also accept extension operations, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing with the sandboxing devs. My first response would be that file presence and hash are sufficient, but maybe we've missed a nuance here.
// (underspecified) inputs to a set of outputs. | ||
// | ||
// When comparing to the filesystem, a corresponding observed-input digest is | ||
// initialized. Each pathset entry is examined against the filesystem, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialized how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put higher level feedback in #156 , just responding to the detailed syntax/semantics here.
// [DynamicExecutionSupport.NOT_SUPPORTED][build.bazel.remote.execution.v2.DynamicExecutionSupport.NOT_SUPPORTED] | ||
// or [DynamicExecutionSupport.UNKNOWN][build.bazel.remote.execution.v2.DynamicExecutionSupport.UNKNOWN] | ||
// is set in the [ExecutionCapabilities][build.bazel.remote.execution.v2.ExecutionCapabilities]. | ||
rpc ExecuteWithDynamicRetrieval(stream StreamedDynamicExecutionData) returns (stream google.longrunning.Operation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably just get rid of Operation here...the main point of Operation is to have a standard-ish way to give a handle to a running execution, for resumption across RPCs. But above, you've forbidden resumption, in which case Operation is an even poorer fit than it already is for Execute. Probably just better to have our own message?
(Though as Alexis commented above, if you want to open the possibility of resumption APIs down the line, you may need to keep the operation name).
// that [TwoPhaseCacheSupport.NOT_SUPPORTED][build.bazel.remote.execution.v2.TwoPhaseCacheSupport.NOT_SUPPORTED] | ||
// or [TwoPhaseCacheSupport.UNKNOWN][build.bazel.remote.execution.v2.TwoPhaseCacheSupport.UNKNOWN] | ||
// is set in the [CacheCapabilities][build.bazel.remote.execution.v2.CacheCapabilities]. | ||
rpc GetPotentialActionResults(GetPotentialActionResultsRequest) returns (GetPotentialActionResultsResponse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice if the names for ExecuteWithDynamicRetrieval and GetPotentialActionResults were clearly paired. Rename to GetDynamicActionResults?
// cached outputs. | ||
message GetPotentialActionResultsResponse { | ||
// A set of possible matches to the client filesystem along with a reference to build outputs. | ||
repeated ObservedPathSetAndInputDigest potential_action_matches = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Require sorting by pathset (under some scheme), so a pathset with multiple candidates can be checked at once? Or use a scheme like (repeated pathset -> (repeated digest->digest pairs)), since it's likely one pathset will have multiple candidates?
// [ActionResult][build.bazel.remote.execution.v2.ActionResult] should be | ||
// retrieved and used. | ||
// | ||
// Generating a directory hash: Create a list of the names of directory members |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a little odd to me to have a new scheme for generating hashes here. We do so elsewhere in the API by building a merkle tree, and you've already detailed above the generalization required to make a partial merkle tree with unspecified nodes and whatnot. Couldn't we instead say something like "update the partial merkle tree with the hashes/contents of the given files/directories", and then re-take the merkle tree hash as the new digest to match against? That way there's no new scheme to learn/implement, only effectively updating the client-predicted paths with an oracle's suggested paths.
// a.h and b.h the list string is "a.h,b.h". Convert this string to uppercase. | ||
// Hash the UTF-8 string bytes. | ||
// | ||
// Discussion of how to generate pathsets from action execution filesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be reasonable to link out to extra content, but this API must stand alone without it (e.g. when that link eventually goes stale). Please make sure a reader without access to this wiki page will have all necessary information to figure out the API here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alexis had a comment about how the algo works (above), and also indicated this content is not understandable without a lot more context. Do you think adding in examples to the doc comments works?
Closing this PR and related issue: Though interesting for some community members, this proposal is unlikely to be accepted into the RE protocol (even for v3) as it needs a second implementation. |
For two-phase caching and dynamic action execution.