Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal: Create dep inference request type #19001

Merged
merged 10 commits into from
Jun 1, 2023

Conversation

tobni
Copy link
Contributor

@tobni tobni commented May 13, 2023

Boilerplate to introduce a request type for native dependency inference suggested here: #18985 (comment)

Adds a "metadata" field to the CacheKey type. Is it naive to store the string as is in the key? Afaict, the PersistentCache is addressed by the hash of the bytes of the key, not the key bytes themselves -> Should be fine.

@tobni tobni requested review from kaos and thejcannon as code owners May 13, 2023 16:47
@tobni tobni force-pushed the add/create-inference-request-type branch from b4ca5fb to a9edc6e Compare May 13, 2023 16:50
@tobni
Copy link
Contributor Author

tobni commented May 13, 2023

Could this solve #18961? 🤔

@thejcannon
Copy link
Member

If the metadata included a version field, it could fix that issue in a very rudimentary way

Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but bowing out to @thejcannon who is deep into the weeds of dep parsing.

Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this. A few comments. LMK if you want some help too

Comment on lines 252 to 261
* Depending on the implementation, a json-serialized string `metadata`
can be passed. It will be supplied to the native parser, and
the string will be incorporated into the cache key.


Example:
result = await Get(NativeParsedPythonDependencies, NativeDependenciesRequest(input_digest, None)
"""

def __init__(self, digest: Digest, metadata: str | None = None) -> None: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we'd want the constructor to take in metadata: Any = None and have the serialization be done for the caller.

You can probably do this 2 ways:

  • Move the type to Python and define __init__. In it use json.dumps.
  • Keep this kinda as-is but more the serialization Rust-side. Feel free to make assumptions and document them.

I think the first option is easiest.
(Also, feel free to unconditionally serialize. None -> "null", so meh...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling json.dumps from rust side of things is the smallest change, is that ok?

Comment on lines 24 to 26
def rules() -> Iterable[QueryRule]:
# Keep in sync with `intrinsics.rs`.
return (QueryRule(NativeParsedPythonDependencies, (NativeDependenciesRequest,)),)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. more of these? 😒

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a comment directed at you, these are annoying little thorns. I'm sure it got you as well :|)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this comment is stale: there should not be any need to create QueryRules aligned with the intrinsics. If you can delete this, then you can probably also delete the corresponding comment in intrinisics.rs.

Copy link
Contributor Author

@tobni tobni May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing exploded when I removed and reran with no pantsd, so I removed this. Lets see if CI agrees.

src/rust/engine/protos/protos/pants/cache.proto Outdated Show resolved Hide resolved
@stuhood
Copy link
Member

stuhood commented May 15, 2023

If the metadata included a version field, it could fix that issue in a very rudimentary way

If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys.

@thejcannon
Copy link
Member

If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys.

I'd rather not wipe the cache that clean (especially the URL cache, because that forces re-download). My thoughts include manual bump, or using a hash of the files in the relevant dep_inference module. Let's discuss on that ticket though,

@stuhood
Copy link
Member

stuhood commented May 15, 2023

If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys.

I'd rather not wipe the cache that clean (especially the URL cache, because that forces re-download). My thoughts include manual bump, or using a hash of the files in the relevant dep_inference module. Let's discuss on that ticket though,

That would not invalidate other entry types in the cache: only the relevant type. Only the field number that is actually used in a protobuf is encoded into it.

@thejcannon
Copy link
Member

That would not invalidate other entry types in the cache: only the relevant type.

That's the idea 😅

@thejcannon
Copy link
Member

Oh you're saying this would "do the right thing" in the case you were describing. Hmm. Yeah let's still move discussion to the ticket? #18961

Comment on lines 786 to 789
let cache_key = CacheKey {
key_type: CacheKeyType::DepInferenceRequest.into(),
digest: Some(digest.into()),
metadata: metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, fwiw: the way that this cache key was intended to be used was that any data which actually made up the key itself would be stored in some serialized format, and then digested to make up the actual key.

I think that adding metadata is fine, but it's a bit of a slippery slope. It's likely that it would be better to define another protobuf struct specific to language parsers, digest/store that, and then use that in the cache key.


Relatedly: the metadata should likely always include a language/implementation-specific key, so that if we end up with two implementations which are capable of parsing the same file, you don't have collisions. Or perhaps that is premature.

Copy link
Contributor Author

@tobni tobni May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--- a/src/rust/engine/protos/protos/pants/cache.proto
+++ b/src/rust/engine/protos/protos/pants/cache.proto
@@ -15,7 +15,11 @@ message CacheKey {
   CacheKeyType key_type = 1;
 
   build.bazel.remote.execution.v2.Digest digest = 2;
-  optional string metadata = 3;  // Any other metadata to accompany the tag
+}
+
+message DependencyInferenceRequest {
+  build.bazel.remote.execution.v2.Digest digest = 1;
+  optional string metadata = 2;  // Any other metadata to accompany the tag
 }

Is this what you're suggesting? Essentially a mirror of DownloadedUrl, except this time we do store the bytes of this struct?

Edit: Removed my earlier comment because it was nonsense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, except you can actually give the digest in DependencyInferenceRequest a name: it's the input file digest. You can probably also use a more specific type for metadata, and/or let it actually be typed as multiple fields.

Copy link
Contributor Author

@tobni tobni May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it would make sense to me to just expose the metadata types for any inference impl via pyo3 instead of:

  1. json serializing in python,
  2. to then deserialize to a struct in rust then
  3. and serialize to bytes.

But that (defining the type in .proto) conflicts with this want/comment a bit:
#18985 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the specific suggestion of json was in slack PM

Then to make it generic, just use a JSON string, each implementation can decide what it's schema is

but still

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since metadata is per impl (I know JS and TS wont share, anyway), is a one of "union" appropriate? I have no exp with protocol buffers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, making metadata a map seems fine, although putting any implementation agnostic fields outside of it would be good.

Copy link
Contributor Author

@tobni tobni May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A map cannot contain maps, the structure Im passing right now is

{
"root": "some/dir", 
"patterns": {"a-pat/*: ["replace-me/*", "and-me-*"],  ...<more-patterns>}
}

Wouldnt a message per type fit better?

Copy link
Contributor Author

@tobni tobni May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the whole patterns structure can be a message containing a map, but unsure if that is what you had in mind

@tobni
Copy link
Contributor Author

tobni commented May 16, 2023

Please review: a551951

I've made an attempt to follow your suggestions @stuhood, I included the metadata structure I had in mind to pass the javascript dependency inference I have a draft #18985 for. I dont think there's a simpler way to define it in protobuf?

@thejcannon The approach in this commit foregoes some of the niceties of staying "untyped" in python. I had a hard time seeing the point of it if we have to maintain a typed protobuf impl of the metadata anyway.

@tobni tobni requested review from stuhood and thejcannon May 18, 2023 23:27
Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably just define the types in Python for simplicity, no?

"""

def __init__(
self, digest: Digest, metadata: JavascriptInferenceMetadata | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 At this point I'd feel more comfortable with per-language request types. Especially since below this is mirrored on the Rust side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree. Only downside is that it is more boilerplate and a lot of Native<language>DependenciesRequest, probably.

#[derive(Clone, Debug, PartialEq)]
pub struct PyNativeDependenciesRequest {
pub directory_digest: DirectoryDigest,
pub metadata: Option<JavascriptInferenceMetadata>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the other 😬

@tobni
Copy link
Contributor Author

tobni commented May 19, 2023

We could probably just define the types in Python for simplicity, no?

Do you mean JavascriptInferenceMetadata? The contained data still need to be converted into the cache key types at some point, I figured a contructor defined close to the "schema" was simple as could be. What do you envision being simpler?

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key structure seems fine to me, but I'll let Josh shipit, since I haven't been following the native interface for this.

@@ -17,6 +17,21 @@ message CacheKey {
build.bazel.remote.execution.v2.Digest digest = 2;
}

message DependencyInferenceRequest {
build.bazel.remote.execution.v2.Digest input_file_digest = 1;
optional JavascriptInferenceMetadata metadata = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since only one of these fields will be set at a time, this should likely be a "union" using: https://protobuf.dev/programming-guides/proto3/#oneof

@thejcannon
Copy link
Member

Yeah, Im not creative enough to come up with something better at the moment.
I'll review closely Monday/Tuesday.

Thanks for your patience

@tobni
Copy link
Contributor Author

tobni commented May 20, 2023

Yeah, Im not creative enough to come up with something better at the moment. I'll review closely Monday/Tuesday.

Thanks for your patience

I just pushed a version where the metadata type exposes "factory" functions for the per-inference metadata impl. That at least avoids type proliferation on the python side, and the amount of type ids required to be known by the engine.

@thejcannon
Copy link
Member

Somehow this feels weirder, but I like avoiding proliferation, so :shipit:

I'll take a closer look soon

@tobni tobni force-pushed the add/create-inference-request-type branch from 732d1ab to bd2b868 Compare May 28, 2023 12:58
@stuhood
Copy link
Member

stuhood commented May 30, 2023

I'll resign from this and let @thejcannon drive.

@stuhood stuhood requested review from stuhood and removed request for stuhood May 30, 2023 22:07
Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience, this is really great stuff!

@tobni tobni added the category:internal CI, fixes for not-yet-released features, etc. label Jun 1, 2023
@tobni tobni force-pushed the add/create-inference-request-type branch from bd2b868 to 610e64a Compare June 1, 2023 12:35
@tobni tobni enabled auto-merge (squash) June 1, 2023 12:41
@tobni tobni merged commit 5becc13 into pantsbuild:main Jun 1, 2023
@wisechengyi wisechengyi mentioned this pull request Jun 3, 2023
wisechengyi added a commit that referenced this pull request Jun 4, 2023
### Internal

* upgrade to Rust v1.70.0
([#19228](#19228))

* Remove the last mentions of NO_TOOL_LOCKFILE.
([#19229](#19229))

* Upgrade Helm unittest
([#19220](#19220))

* Prepare `2.17.0rc0`.
([#19226](#19226))

* Use a trait for remote action result caching
([#19097](#19097))

* Port `Field` to Rust
([#19143](#19143))

* Upgrade `pyo3` to `0.19`.
([#19223](#19223))

* Prepare `2.16.0rc5`.
([#19221](#19221))

* internal: Create dep inference request type
([#19001](#19001))

* Avoid requiring Python when trying to install Python (using Pyenv)
([#19208](#19208))

* Fix secondary ownership warning semantics
([#19191](#19191))

* Bump os_pipe from 1.1.3 to 1.1.4 in /src/rust/engine
([#19202](#19202))

* Include committer date in local version identifier of unstable builds
([#19179](#19179))

* Ensure we set the AWS region.
([#19178](#19178))
thejcannon pushed a commit that referenced this pull request Aug 22, 2023
…herry-pick of #19630) (#19640)

This adds the file path for the file being dependency-inferred to the
cache key for the Rust dep inference. Without this, the same file
contents appearing multiple times in different places will give the
wrong results for relative imports, because the dep inference process
mashes together the file path and the relative import.

The circumstances here seem most likely to occur in the real world if a
file is moved, with the inference results from before the move reused
for the file _after_ the move too.

I've tested this manually on the reproducer in
#19618 (comment),
in addition to comparing the before (fails) and after (passes) behaviour
of the new test.

To try to make this code more resilient to this sort of mistake, I've
removed the top level `PreparedInferenceRequest.path` key, in favour of
using the new `input_file_path` on the request protobuf struct, at the
cost of an additional `String` -> `PathBuf` conversion when doing the
inference.

Fixes #19618 

This is a cherry pick of #19630, but is essentially a slightly weird
rewrite that partially cherry-picks #19001 too, by copying over the
whole `DependencyInferenceRequest` protobuf type as is (even with some
fields that are unused) because that's the easiest way to generate
appropriate bytes for turning into the digest for use in the cache-key.
I think it's okay to have this "weird" behaviour in the 2.17 branch,
with the real/normal code in main/2.18?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants