internal: Create dep inference request type #19001

tobni · 2023-05-13T16:47:44Z

Boilerplate to introduce a request type for native dependency inference suggested here: #18985 (comment)

Adds a "metadata" field to the CacheKey type. Is it naive to store the string as is in the key? Afaict, the PersistentCache is addressed by the hash of the bytes of the key, not the key bytes themselves -> Should be fine.

tobni · 2023-05-13T16:51:29Z

Could this solve #18961? 🤔

thejcannon · 2023-05-13T16:53:39Z

If the metadata included a version field, it could fix that issue in a very rudimentary way

kaos

lgtm but bowing out to @thejcannon who is deep into the weeds of dep parsing.

thejcannon

Thanks for tackling this. A few comments. LMK if you want some help too

thejcannon · 2023-05-15T16:02:32Z

src/python/pants/engine/internals/native_engine.pyi

+    * Depending on the implementation, a json-serialized string `metadata`
+      can be passed. It will be supplied to the native parser, and
+      the string will be incorporated into the cache key.
+
+
+    Example:
+        result = await Get(NativeParsedPythonDependencies, NativeDependenciesRequest(input_digest, None)
+    """
+
+    def __init__(self, digest: Digest, metadata: str | None = None) -> None: ...


Feels like we'd want the constructor to take in metadata: Any = None and have the serialization be done for the caller.

You can probably do this 2 ways:

Move the type to Python and define __init__. In it use json.dumps.

Keep this kinda as-is but more the serialization Rust-side. Feel free to make assumptions and document them.

I think the first option is easiest.
(Also, feel free to unconditionally serialize. None -> "null", so meh...)

Calling json.dumps from rust side of things is the smallest change, is that ok?

thejcannon · 2023-05-15T16:03:50Z

src/python/pants/engine/internals/native_dep_inference.py

+def rules() -> Iterable[QueryRule]:
+    # Keep in sync with `intrinsics.rs`.
+    return (QueryRule(NativeParsedPythonDependencies, (NativeDependenciesRequest,)),)


Ugh. more of these? 😒

(Not a comment directed at you, these are annoying little thorns. I'm sure it got you as well :|)

I think that this comment is stale: there should not be any need to create QueryRules aligned with the intrinsics. If you can delete this, then you can probably also delete the corresponding comment in intrinisics.rs.

Nothing exploded when I removed and reran with no pantsd, so I removed this. Lets see if CI agrees.

src/rust/engine/protos/protos/pants/cache.proto

stuhood · 2023-05-15T17:09:08Z

If the metadata included a version field, it could fix that issue in a very rudimentary way

If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys.

thejcannon · 2023-05-15T17:20:49Z

If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys.

I'd rather not wipe the cache that clean (especially the URL cache, because that forces re-download). My thoughts include manual bump, or using a hash of the files in the relevant dep_inference module. Let's discuss on that ticket though,

stuhood · 2023-05-15T17:24:29Z

If you want to version any of these, you can bump the protobuf field number of the key, such that you never hit for old keys.

I'd rather not wipe the cache that clean (especially the URL cache, because that forces re-download). My thoughts include manual bump, or using a hash of the files in the relevant dep_inference module. Let's discuss on that ticket though,

That would not invalidate other entry types in the cache: only the relevant type. Only the field number that is actually used in a protobuf is encoded into it.

thejcannon · 2023-05-15T17:52:11Z

That would not invalidate other entry types in the cache: only the relevant type.

That's the idea 😅

thejcannon · 2023-05-15T17:52:59Z

Oh you're saying this would "do the right thing" in the case you were describing. Hmm. Yeah let's still move discussion to the ticket? #18961

stuhood · 2023-05-15T19:12:22Z

src/rust/engine/src/intrinsics.rs

        let cache_key = CacheKey {
          key_type: CacheKeyType::DepInferenceRequest.into(),
          digest: Some(digest.into()),
+          metadata: metadata,


So, fwiw: the way that this cache key was intended to be used was that any data which actually made up the key itself would be stored in some serialized format, and then digested to make up the actual key.

I think that adding metadata is fine, but it's a bit of a slippery slope. It's likely that it would be better to define another protobuf struct specific to language parsers, digest/store that, and then use that in the cache key.

Relatedly: the metadata should likely always include a language/implementation-specific key, so that if we end up with two implementations which are capable of parsing the same file, you don't have collisions. Or perhaps that is premature.

--- a/src/rust/engine/protos/protos/pants/cache.proto +++ b/src/rust/engine/protos/protos/pants/cache.proto @@ -15,7 +15,11 @@ message CacheKey { CacheKeyType key_type = 1; build.bazel.remote.execution.v2.Digest digest = 2; - optional string metadata = 3; // Any other metadata to accompany the tag +} + +message DependencyInferenceRequest { + build.bazel.remote.execution.v2.Digest digest = 1; + optional string metadata = 2; // Any other metadata to accompany the tag }

Is this what you're suggesting? Essentially a mirror of DownloadedUrl, except this time we do store the bytes of this struct?

Edit: Removed my earlier comment because it was nonsense.

Right, except you can actually give the digest in DependencyInferenceRequest a name: it's the input file digest. You can probably also use a more specific type for metadata, and/or let it actually be typed as multiple fields.

Then it would make sense to me to just expose the metadata types for any inference impl via pyo3 instead of:

json serializing in python,

to then deserialize to a struct in rust then

and serialize to bytes.

But that (defining the type in .proto) conflicts with this want/comment a bit:
#18985 (comment)

Sorry, the specific suggestion of json was in slack PM

Then to make it generic, just use a JSON string, each implementation can decide what it's schema is

but still

Since metadata is per impl (I know JS and TS wont share, anyway), is a one of "union" appropriate? I have no exp with protocol buffers.

In that case, making metadata a map seems fine, although putting any implementation agnostic fields outside of it would be good.

A map cannot contain maps, the structure Im passing right now is

{ "root": "some/dir", "patterns": {"a-pat/*: ["replace-me/*", "and-me-*"], ...<more-patterns>} }

Wouldnt a message per type fit better?

I realize the whole patterns structure can be a message containing a map, but unsure if that is what you had in mind

tobni · 2023-05-16T17:47:38Z

Please review: a551951

I've made an attempt to follow your suggestions @stuhood, I included the metadata structure I had in mind to pass the javascript dependency inference I have a draft #18985 for. I dont think there's a simpler way to define it in protobuf?

@thejcannon The approach in this commit foregoes some of the niceties of staying "untyped" in python. I had a hard time seeing the point of it if we have to maintain a typed protobuf impl of the metadata anyway.

thejcannon

We could probably just define the types in Python for simplicity, no?

thejcannon · 2023-05-19T14:46:51Z

src/python/pants/engine/internals/native_engine.pyi

+    """
+
+    def __init__(
+        self, digest: Digest, metadata: JavascriptInferenceMetadata | None = None


😬 At this point I'd feel more comfortable with per-language request types. Especially since below this is mirrored on the Rust side.

I think I agree. Only downside is that it is more boilerplate and a lot of Native<language>DependenciesRequest, probably.

thejcannon · 2023-05-19T14:47:08Z

src/rust/engine/src/externs/dep_inference.rs

+#[derive(Clone, Debug, PartialEq)]
+pub struct PyNativeDependenciesRequest {
+  pub directory_digest: DirectoryDigest,
+  pub metadata: Option<JavascriptInferenceMetadata>,


Here's the other 😬

tobni · 2023-05-19T15:28:40Z

We could probably just define the types in Python for simplicity, no?

Do you mean JavascriptInferenceMetadata? The contained data still need to be converted into the cache key types at some point, I figured a contructor defined close to the "schema" was simple as could be. What do you envision being simpler?

stuhood

The key structure seems fine to me, but I'll let Josh shipit, since I haven't been following the native interface for this.

stuhood · 2023-05-19T18:33:08Z

src/rust/engine/protos/protos/pants/cache.proto

@@ -17,6 +17,21 @@ message CacheKey {
  build.bazel.remote.execution.v2.Digest digest = 2;
 }

+message DependencyInferenceRequest {
+  build.bazel.remote.execution.v2.Digest input_file_digest = 1;
+  optional JavascriptInferenceMetadata metadata = 2;


Since only one of these fields will be set at a time, this should likely be a "union" using: https://protobuf.dev/programming-guides/proto3/#oneof

thejcannon · 2023-05-20T11:00:04Z

Yeah, Im not creative enough to come up with something better at the moment.
I'll review closely Monday/Tuesday.

Thanks for your patience

tobni · 2023-05-20T12:15:12Z

Yeah, Im not creative enough to come up with something better at the moment. I'll review closely Monday/Tuesday.

Thanks for your patience

I just pushed a version where the metadata type exposes "factory" functions for the per-inference metadata impl. That at least avoids type proliferation on the python side, and the amount of type ids required to be known by the engine.

thejcannon · 2023-05-20T12:30:35Z

Somehow this feels weirder, but I like avoiding proliferation, so

I'll take a closer look soon

stuhood · 2023-05-30T22:07:03Z

I'll resign from this and let @thejcannon drive.

thejcannon

Thanks for your patience, this is really great stuff!

Co-authored-by: Joshua Cannon <joshdcannon@gmail.com>

Since it will be used for the CacheKey

Hide the enum behind static methods on the python type

### Internal * upgrade to Rust v1.70.0 ([#19228](#19228)) * Remove the last mentions of NO_TOOL_LOCKFILE. ([#19229](#19229)) * Upgrade Helm unittest ([#19220](#19220)) * Prepare `2.17.0rc0`. ([#19226](#19226)) * Use a trait for remote action result caching ([#19097](#19097)) * Port `Field` to Rust ([#19143](#19143)) * Upgrade `pyo3` to `0.19`. ([#19223](#19223)) * Prepare `2.16.0rc5`. ([#19221](#19221)) * internal: Create dep inference request type ([#19001](#19001)) * Avoid requiring Python when trying to install Python (using Pyenv) ([#19208](#19208)) * Fix secondary ownership warning semantics ([#19191](#19191)) * Bump os_pipe from 1.1.3 to 1.1.4 in /src/rust/engine ([#19202](#19202)) * Include committer date in local version identifier of unstable builds ([#19179](#19179)) * Ensure we set the AWS region. ([#19178](#19178))

…herry-pick of #19630) (#19640) This adds the file path for the file being dependency-inferred to the cache key for the Rust dep inference. Without this, the same file contents appearing multiple times in different places will give the wrong results for relative imports, because the dep inference process mashes together the file path and the relative import. The circumstances here seem most likely to occur in the real world if a file is moved, with the inference results from before the move reused for the file _after_ the move too. I've tested this manually on the reproducer in #19618 (comment), in addition to comparing the before (fails) and after (passes) behaviour of the new test. To try to make this code more resilient to this sort of mistake, I've removed the top level `PreparedInferenceRequest.path` key, in favour of using the new `input_file_path` on the request protobuf struct, at the cost of an additional `String` -> `PathBuf` conversion when doing the inference. Fixes #19618 This is a cherry pick of #19630, but is essentially a slightly weird rewrite that partially cherry-picks #19001 too, by copying over the whole `DependencyInferenceRequest` protobuf type as is (even with some fields that are unused) because that's the easiest way to generate appropriate bytes for turning into the digest for use in the cache-key. I think it's okay to have this "weird" behaviour in the 2.17 branch, with the real/normal code in main/2.18?

tobni requested review from kaos and thejcannon as code owners May 13, 2023 16:47

tobni force-pushed the add/create-inference-request-type branch from b4ca5fb to a9edc6e Compare May 13, 2023 16:50

tobni mentioned this pull request May 13, 2023

javascript: Rust based dep inference for javascript #18985

Merged

kaos reviewed May 15, 2023

View reviewed changes

thejcannon reviewed May 15, 2023

View reviewed changes

stuhood reviewed May 15, 2023

View reviewed changes

stuhood mentioned this pull request May 15, 2023

The new dep inference cache needs versioning #18961

Closed

tobni requested review from stuhood and thejcannon May 18, 2023 23:27

thejcannon reviewed May 19, 2023

View reviewed changes

stuhood reviewed May 19, 2023

View reviewed changes

tobni force-pushed the add/create-inference-request-type branch from 732d1ab to bd2b868 Compare May 28, 2023 12:58

stuhood requested review from stuhood and removed request for stuhood May 30, 2023 22:07

thejcannon approved these changes May 31, 2023

View reviewed changes

tobni added the category:internal CI, fixes for not-yet-released features, etc. label Jun 1, 2023

tobni and others added 10 commits June 1, 2023 14:35

feat: Add metadata field to CacheKey

dafbb1b

feat: Add a native dep parser request type

4715ea7

doc: Remove details from comment

efc3501

Co-authored-by: Joshua Cannon <joshdcannon@gmail.com>

feat: Call json dumps in constructor

807497c

refactor: Remove unecessary boilerplate

1924634

doc: Nuke stale comment

9826584

doc: Remove stale comment in intrinsics.rs

6a27476

feat: Sort the keys

6784c94

Since it will be used for the CacheKey

feat: Implement metadata as a protobuf serializble type

d1b4d6c

feat: Use oneof for metadata field

610e64a

Hide the enum behind static methods on the python type

tobni force-pushed the add/create-inference-request-type branch from bd2b868 to 610e64a Compare June 1, 2023 12:35

tobni enabled auto-merge (squash) June 1, 2023 12:41

tobni merged commit 5becc13 into pantsbuild:main Jun 1, 2023

wisechengyi mentioned this pull request Jun 3, 2023

Prep 2.18.0.dev1 #19233

Merged

huonw mentioned this pull request Aug 22, 2023

Add path to cache key for Rust dep inference, for relative imports (cherry-pick of #19630) #19640

Merged

internal: Create dep inference request type #19001

internal: Create dep inference request type #19001

Conversation

tobni commented May 13, 2023

tobni commented May 13, 2023

thejcannon commented May 13, 2023

kaos left a comment

Choose a reason for hiding this comment

thejcannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobni May 15, 2023 • edited Loading

Choose a reason for hiding this comment

stuhood commented May 15, 2023

thejcannon commented May 15, 2023

stuhood commented May 15, 2023 • edited Loading

thejcannon commented May 15, 2023

thejcannon commented May 15, 2023

Choose a reason for hiding this comment

tobni May 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobni May 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobni May 15, 2023 • edited Loading

Choose a reason for hiding this comment

tobni May 15, 2023 • edited Loading

Choose a reason for hiding this comment

tobni commented May 16, 2023 • edited Loading

thejcannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobni commented May 19, 2023 • edited Loading

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thejcannon commented May 20, 2023

tobni commented May 20, 2023 • edited Loading

thejcannon commented May 20, 2023

stuhood commented May 30, 2023

thejcannon left a comment

Choose a reason for hiding this comment

tobni May 15, 2023 •

edited

Loading

stuhood commented May 15, 2023 •

edited

Loading

tobni May 15, 2023 •

edited

Loading

tobni May 15, 2023 •

edited

Loading

tobni May 15, 2023 •

edited

Loading

tobni May 15, 2023 •

edited

Loading

tobni commented May 16, 2023 •

edited

Loading

tobni commented May 19, 2023 •

edited

Loading

tobni commented May 20, 2023 •

edited

Loading