-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal: Create dep inference request type #19001
internal: Create dep inference request type #19001
Conversation
b4ca5fb
to
a9edc6e
Compare
Could this solve #18961? 🤔 |
If the metadata included a version field, it could fix that issue in a very rudimentary way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but bowing out to @thejcannon who is deep into the weeds of dep parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this. A few comments. LMK if you want some help too
* Depending on the implementation, a json-serialized string `metadata` | ||
can be passed. It will be supplied to the native parser, and | ||
the string will be incorporated into the cache key. | ||
|
||
|
||
Example: | ||
result = await Get(NativeParsedPythonDependencies, NativeDependenciesRequest(input_digest, None) | ||
""" | ||
|
||
def __init__(self, digest: Digest, metadata: str | None = None) -> None: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like we'd want the constructor to take in metadata: Any = None
and have the serialization be done for the caller.
You can probably do this 2 ways:
- Move the type to Python and define
__init__
. In it usejson.dumps
. - Keep this kinda as-is but more the serialization Rust-side. Feel free to make assumptions and document them.
I think the first option is easiest.
(Also, feel free to unconditionally serialize. None
-> "null"
, so meh...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling json.dumps from rust side of things is the smallest change, is that ok?
def rules() -> Iterable[QueryRule]: | ||
# Keep in sync with `intrinsics.rs`. | ||
return (QueryRule(NativeParsedPythonDependencies, (NativeDependenciesRequest,)),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. more of these? 😒
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not a comment directed at you, these are annoying little thorns. I'm sure it got you as well :|)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this comment is stale: there should not be any need to create QueryRule
s aligned with the intrinsics. If you can delete this, then you can probably also delete the corresponding comment in intrinisics.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing exploded when I removed and reran with no pantsd, so I removed this. Lets see if CI agrees.
If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys. |
I'd rather not wipe the cache that clean (especially the URL cache, because that forces re-download). My thoughts include manual bump, or using a hash of the files in the relevant |
That would not invalidate other entry types in the cache: only the relevant type. Only the field number that is actually used in a protobuf is encoded into it. |
That's the idea 😅 |
Oh you're saying this would "do the right thing" in the case you were describing. Hmm. Yeah let's still move discussion to the ticket? #18961 |
src/rust/engine/src/intrinsics.rs
Outdated
let cache_key = CacheKey { | ||
key_type: CacheKeyType::DepInferenceRequest.into(), | ||
digest: Some(digest.into()), | ||
metadata: metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, fwiw: the way that this cache key was intended to be used was that any data which actually made up the key itself would be stored in some serialized format, and then digested to make up the actual key.
I think that adding metadata
is fine, but it's a bit of a slippery slope. It's likely that it would be better to define another protobuf struct specific to language parsers, digest/store that, and then use that in the cache key.
Relatedly: the metadata should likely always include a language/implementation-specific key, so that if we end up with two implementations which are capable of parsing the same file, you don't have collisions. Or perhaps that is premature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--- a/src/rust/engine/protos/protos/pants/cache.proto
+++ b/src/rust/engine/protos/protos/pants/cache.proto
@@ -15,7 +15,11 @@ message CacheKey {
CacheKeyType key_type = 1;
build.bazel.remote.execution.v2.Digest digest = 2;
- optional string metadata = 3; // Any other metadata to accompany the tag
+}
+
+message DependencyInferenceRequest {
+ build.bazel.remote.execution.v2.Digest digest = 1;
+ optional string metadata = 2; // Any other metadata to accompany the tag
}
Is this what you're suggesting? Essentially a mirror of DownloadedUrl
, except this time we do store the bytes of this struct?
Edit: Removed my earlier comment because it was nonsense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, except you can actually give the digest
in DependencyInferenceRequest
a name: it's the input file digest. You can probably also use a more specific type for metadata, and/or let it actually be typed as multiple fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it would make sense to me to just expose the metadata types for any inference impl via pyo3 instead of:
- json serializing in python,
- to then deserialize to a struct in rust then
- and serialize to bytes.
But that (defining the type in .proto) conflicts with this want/comment a bit:
#18985 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the specific suggestion of json was in slack PM
Then to make it generic, just use a JSON string, each implementation can decide what it's schema is
but still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since metadata is per impl (I know JS and TS wont share, anyway), is a one of
"union" appropriate? I have no exp with protocol buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, making metadata
a map seems fine, although putting any implementation agnostic fields outside of it would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A map cannot contain maps, the structure Im passing right now is
{
"root": "some/dir",
"patterns": {"a-pat/*: ["replace-me/*", "and-me-*"], ...<more-patterns>}
}
Wouldnt a message per type fit better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize the whole patterns structure can be a message containing a map, but unsure if that is what you had in mind
Please review: a551951 I've made an attempt to follow your suggestions @stuhood, I included the metadata structure I had in mind to pass the javascript dependency inference I have a draft #18985 for. I dont think there's a simpler way to define it in protobuf? @thejcannon The approach in this commit foregoes some of the niceties of staying "untyped" in python. I had a hard time seeing the point of it if we have to maintain a typed protobuf impl of the metadata anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably just define the types in Python for simplicity, no?
""" | ||
|
||
def __init__( | ||
self, digest: Digest, metadata: JavascriptInferenceMetadata | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬 At this point I'd feel more comfortable with per-language request types. Especially since below this is mirrored on the Rust side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree. Only downside is that it is more boilerplate and a lot of Native<language>DependenciesRequest
, probably.
#[derive(Clone, Debug, PartialEq)] | ||
pub struct PyNativeDependenciesRequest { | ||
pub directory_digest: DirectoryDigest, | ||
pub metadata: Option<JavascriptInferenceMetadata>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the other 😬
Do you mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key structure seems fine to me, but I'll let Josh shipit, since I haven't been following the native interface for this.
@@ -17,6 +17,21 @@ message CacheKey { | |||
build.bazel.remote.execution.v2.Digest digest = 2; | |||
} | |||
|
|||
message DependencyInferenceRequest { | |||
build.bazel.remote.execution.v2.Digest input_file_digest = 1; | |||
optional JavascriptInferenceMetadata metadata = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since only one of these fields will be set at a time, this should likely be a "union" using: https://protobuf.dev/programming-guides/proto3/#oneof
Yeah, Im not creative enough to come up with something better at the moment. Thanks for your patience |
I just pushed a version where the metadata type exposes "factory" functions for the per-inference metadata impl. That at least avoids type proliferation on the python side, and the amount of type ids required to be known by the engine. |
Somehow this feels weirder, but I like avoiding proliferation, so I'll take a closer look soon |
732d1ab
to
bd2b868
Compare
I'll resign from this and let @thejcannon drive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience, this is really great stuff!
Co-authored-by: Joshua Cannon <joshdcannon@gmail.com>
Since it will be used for the CacheKey
Hide the enum behind static methods on the python type
bd2b868
to
610e64a
Compare
### Internal * upgrade to Rust v1.70.0 ([#19228](#19228)) * Remove the last mentions of NO_TOOL_LOCKFILE. ([#19229](#19229)) * Upgrade Helm unittest ([#19220](#19220)) * Prepare `2.17.0rc0`. ([#19226](#19226)) * Use a trait for remote action result caching ([#19097](#19097)) * Port `Field` to Rust ([#19143](#19143)) * Upgrade `pyo3` to `0.19`. ([#19223](#19223)) * Prepare `2.16.0rc5`. ([#19221](#19221)) * internal: Create dep inference request type ([#19001](#19001)) * Avoid requiring Python when trying to install Python (using Pyenv) ([#19208](#19208)) * Fix secondary ownership warning semantics ([#19191](#19191)) * Bump os_pipe from 1.1.3 to 1.1.4 in /src/rust/engine ([#19202](#19202)) * Include committer date in local version identifier of unstable builds ([#19179](#19179)) * Ensure we set the AWS region. ([#19178](#19178))
…herry-pick of #19630) (#19640) This adds the file path for the file being dependency-inferred to the cache key for the Rust dep inference. Without this, the same file contents appearing multiple times in different places will give the wrong results for relative imports, because the dep inference process mashes together the file path and the relative import. The circumstances here seem most likely to occur in the real world if a file is moved, with the inference results from before the move reused for the file _after_ the move too. I've tested this manually on the reproducer in #19618 (comment), in addition to comparing the before (fails) and after (passes) behaviour of the new test. To try to make this code more resilient to this sort of mistake, I've removed the top level `PreparedInferenceRequest.path` key, in favour of using the new `input_file_path` on the request protobuf struct, at the cost of an additional `String` -> `PathBuf` conversion when doing the inference. Fixes #19618 This is a cherry pick of #19630, but is essentially a slightly weird rewrite that partially cherry-picks #19001 too, by copying over the whole `DependencyInferenceRequest` protobuf type as is (even with some fields that are unused) because that's the easiest way to generate appropriate bytes for turning into the digest for use in the cache-key. I think it's okay to have this "weird" behaviour in the 2.17 branch, with the real/normal code in main/2.18?
Boilerplate to introduce a request type for native dependency inference suggested here: #18985 (comment)
Adds a "metadata" field to the CacheKey type. Is it naive to store the string as is in the key? Afaict, the
PersistentCache
is addressed by the hash of the bytes of the key, not the key bytes themselves -> Should be fine.