Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kick off Red-knot #10849

Merged
merged 59 commits into from
Apr 27, 2024
Merged

Kick off Red-knot #10849

merged 59 commits into from
Apr 27, 2024

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Apr 9, 2024

The beginning of multifile analysis. We'll eventually merge this with ruff but are using a dedicated crate to flesh out the basic infrastructure first.

Copy link

codspeed-hq bot commented Apr 9, 2024

CodSpeed Performance Report

Merging #10849 will not alter performance

Comparing red-knot (dd4748b) with main (845ba7c)

Summary

✅ 30 untouched benchmarks

crates/red_knot/src/check.rs Outdated Show resolved Hide resolved
crates/red_knot/src/check.rs Outdated Show resolved Hide resolved
@MichaReiser MichaReiser force-pushed the red-knot branch 2 times, most recently from 49dbe6b to 34fa6fd Compare April 10, 2024 14:01
crates/red_knot/Cargo.toml Outdated Show resolved Hide resolved
#[derive(Debug, Eq, PartialEq, Hash)]
pub struct FileAstId<N: HasAstId> {
ast_id: AstId,
_marker: PhantomData<fn() -> N>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this marker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust requires for structs that each type parameter is used by at least one field. Now, AstId isn't generic over N.

We can work around this by using PhantomData. PhantomData has no runtime cost (it compiles down to a zero size type) and it's only purpose is to capture the type N.

You can think of the pattern applied here as compile-time only generics similar to Java where the generic arguments are erased at runtime but we want them at compile time to catch typing errors (e.g. we want to prevent that you use a AstId for an IfStmt to load a FunctionDef).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I find a bit odd here is the fn() -> N; why is it not just PhantomData<N>? It seems like putting the typevar in return position like that might be intended to make TypedAstId contravariant in N rather than variant? But since Rust doesn't have struct subtyping, I'm not really sure what that would even mean; I thought Rust only really has variance for lifetimes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is (and I was surprised by it) that Phantom<Data> made the type non-copyable, NonEq etc.

I guess the alternative would have been to implement all these types manually https://users.rust-lang.org/t/how-to-copy-phantomdata-of-un-clone-able-types/82229

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh weird! So basically it's just that Rust knows how to implement Eq for fn() -> N but not necessarily for N?

crates/red_knot/src/ast_ids.rs Outdated Show resolved Hide resolved
crates/red_knot/src/hir.rs Show resolved Hide resolved
crates/red_knot/src/ast_ids.rs Outdated Show resolved Hide resolved
crates/red_knot/src/db.rs Outdated Show resolved Hide resolved
crates/red_knot/src/symbols.rs Outdated Show resolved Hide resolved
crates/red_knot/src/symbols.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module stuff looks great! Sorry for leaving so many comments; many of them can probably just turn into TODOs for now.

crates/red_knot/src/module.rs Show resolved Hide resolved
Self(smol_str::SmolStr::new(name))
}

pub fn from_relative_path(path: &Path) -> Option<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps somewhere in here we should validate that the extensions is .py or .pyi?

This actually makes me think that perhaps from_relative_path should not exist at the ModuleName layer but at a layer where it returns several pieces of information: a ModuleName, a ModuleKind (e.g. ::Python or ::Stub), and also maybe is_package boolean (e.g. true for foo/__init__.py, false for foo.py). We will need all of those at some point; as implemented currently this from_relative_path is a lossy operation, since it returns only one of those.

Perhaps those other two fields, and this method, belong on Module/ModuleData?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ModuleKind and whether it is a package belongs on ModuleData. ModuleName is just the full qualified name of a package. The only reason I see for adding them to ModuleName is if we want to support explicitly querying modules by their kind, but I don't think this is something we want. That's why I think from_relative_path is fine. All it should do is to create the full qualified name from a relative import (it shouldn't perform any IO).

Perhaps somewhere in here we should validate that the extensions is .py or .pyi?

Yeah, early returning when it is not a py or pyi file makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I agree this data belongs on ModuleData, not on ModuleName, I wasn't suggesting to put it on ModuleName.

What I was suggesting was that from_relative_path should not be implemented like this as a ModuleName constructor, because then we throw away information from the path that we will need. I think from_relative_path should instead be implemented at a higher level where it returns all of the data that we can glean from the path, including the ModuleName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see us throwing away any information, considering that this method doesn't do any Io. The idea here is that we get a full qualified name that we can then throw into the resolve function (that retrieves all information)

Copy link
Contributor

@carljm carljm Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, I was looking at this method in isolation; I see the only place it's actually used is in resolve_path, where the very next step is to resolve that module name back to a path (and make sure it actually resolves to the right path). So I agree, we will get all that information from resolving, and that's where we should get it from.

I would still prefer for this to be a private method of ModuleResolver rather than a public constructor of ModuleName, because I think it's kind of important to ensure that it only be used in the context of resolve_path, as just described. We don't want some code in the future constructing ModuleName using ModuleName::from_relative_path and assuming that means it has a correct module name to path correspondence. But this is more a future thing for robustness, not an important prototype consideration.

Copy link
Member Author

@MichaReiser MichaReiser Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't mind making it private and agree, the only two methods that need to be public are relative and new (or absolute).

Self(smol_str::SmolStr::new(name))
}

pub fn relative(_dots: u32, name: &str, _to: &Path) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm here will be simpler if instead of taking a full Path for _to parameter, we just take a ModuleName and is_package boolean (or some structure that encompasses both, might just be Module).

We need is_package because in foo/bar/__init__.py, from . import baz means foo.bar.baz, but from foo/bar.py, from . import baz means foo.baz.

If we take a full Path here, then we effectively have to re-implement (or call) from_relative_path again as part of this method, but in the places we will likely call this from (resolving a relative import we find in the AST) we will already have all the data of the current module handy, so it will be wasteful to recalculate that from Path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, it also avoids the need to check if the file has a .py extension haha

The only challenge with this is that we need to analyze files that aren't modules (and may not have a module name):

  • Jupyter notebooks
  • Files that don't have a py or pyi extension (Ruff supports configuring additional extensions that should be handled as python files)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about Jupyter notebooks, but I would assume they just can't have relative imports?

For files that don't have a py or pyi extension, I would assume we still treat them as modules as if they did? I'm not sure what the use cases for this is.

crates/red_knot/src/module.rs Show resolved Hide resolved
crates/red_knot/src/module.rs Outdated Show resolved Hide resolved
crates/red_knot/src/module.rs Outdated Show resolved Hide resolved
crates/red_knot/src/module.rs Outdated Show resolved Hide resolved
crates/red_knot/src/module.rs Outdated Show resolved Hide resolved
crates/red_knot/src/module.rs Outdated Show resolved Hide resolved
crates/red_knot/src/module.rs Show resolved Hide resolved
@MichaReiser MichaReiser changed the title Exploration of a salsa like compilation model (does not compile) Red Knot Apr 19, 2024
// src
// parent
// child
// one.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
// one.py
// __init__.py
// one.py

@JonathanPlasse
Copy link
Contributor

https://excalidraw.com/#json=-Thvh6hnezji3DT3SfFYs,Hjt_fOpRTgpgNKy9Hfb9-Q

The link does not seem to be public.

@MichaReiser
Copy link
Member Author

@JonathanPlasse I just opened it in a private session without any problems. Do you get an error that the link is invalid? Also, the link is a bit outdated.

@trag1c
Copy link
Contributor

trag1c commented Apr 22, 2024

I can open it just fine 👍

@JonathanPlasse
Copy link
Contributor

It works for me too in private session. Sorry for the noise.

/// Arena holding all known types
#[derive(Default)]
pub(crate) struct TypeEnvironment {
types_by_id: IndexVec<TypeId, Type>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be interesting if we can come up with a more stable id for types than relying on inference order.

Copy link
Contributor

@carljm carljm Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good point, I thought about this. I don't think TypeId (as in, the index into this IndexVec) can be stable; that's not really compatible with lazy type evaluation. But we may want to have another, more stable identifier (based on fully qualified name?) that we use most places instead of TypeId.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also abandon using the IndexVec arena in this case? But we will have a lot of types...

Copy link
Contributor

github-actions bot commented Apr 23, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser and others added 22 commits April 27, 2024 10:11
This is needed because of the AST changes from `OnceCell` to `OnceLock`.

If we decide we don't need AST to be Sync/Send, we can revert this along
with those changes; just want green CI for now.

Also adjust a URL in doc comments so `cargo docs` doesn't complain about
it.
Just a reorganization per feedback from Micha that I agreed with. Puts
all `impl` right after the corresponding `struct`, and moves iterators
further down in the file, since they are implementation detail.
## Summary

Indexing definitions of symbols is necessary for fast lazy type
evaluation, so that when we see a reference to a name we can figure out
the type of that symbol from its definition(s).

We only want to do this indexing of definitions once per module, and
then it needs to remain available in the cache thereafter for use by
lazy type evaluation.

Rust lifetimes won't let us use direct references to AST nodes owned by
the cache.

We could use unsafe code to strip the lifetimes from these references,
with safety ensured by our cache invalidation: if we evict the AST from
the cache, we must evict the symbol definitions also. But in order to
serialize such a cache to disk, we would need (at least) an AST
numbering scheme. This may still be something to look into in the
future, for improved performance.

For now, use `NodeKey`: indirect references to an AST node consisting of
a `NodeKind` and a `TextRange`, which we can find again reasonably
quickly in the AST. These are easy to serialize, have no lifetime
problems, and don't require unsafe code.

## Test Plan

Updated tests.
## Summary

Now that symbol definitions don't hold direct references to AST nodes,
and thus don't have a lifetime bound, there's no longer any reason to
separate the "core SymbolTable" from the definitions. All of the symbol
table, including definitions, can be cached together, and it all needs
to be invalidated together if the module AST changes. So let's simplify
this and have fewer structs.

## Test Plan

Existing tests.
Fill out the representation of types to a few more cases (especially
unions and intersections) before going too far with type evaluation.
I got this all working and solved the API lifetime issues without Arc,
by means of a new set of `TypeRef` structs.

The remaining potential performance issue is that anytime you hold on to
any of the new `TypeRef` structs, you lock a shard of the
`TypeStore::modules` dashmap to writes (because you are holding a
reference into it). So it will be important to minimize the use and
scope of these type-refs. I think we can do this to some degree by
caching type judgments using just type IDs. I also think for CLI use
when we want to be highly parallel, we can be smart about ordering
(check all module bodies first, then check function bodies when module
level types are all populated) to minimize write contention. Also, if
needed we can break up `ModuleTypeStore`, or use inner mutability and
internal locking to have finer-grained locking within it.

I went with this version instead of rewriting to have the type arenas
hold Arc to the types, because I am not totally convinced the Arc
version will be better. With Arc every "read" turns into a write to the
atomic reference count, which introduces overhead (which is really
useless overhead for us, since ultimately we rely on the arenas for
garbage collection). And so we will introduce contention on the atomic
reference count even for reads of highly-used types. So for both
versions we will have to be careful with our use of references. I think
the Arc-free version is lower overhead and sets us up better for future
optimization of the locking strategy, once we have more working code to
optimize against.

Even if I turn out to be wrong about the above and eventually we decide
to use Arc, I'd rather go with this for now and move on to type
evaluation, and make the Arc change later when we can evaluate the
effects better.
This PR demonstrates resolving an import from one module to a class type
from another module!
@MichaReiser MichaReiser added the internal An internal refactor or improvement label Apr 27, 2024
@MichaReiser MichaReiser marked this pull request as ready for review April 27, 2024 08:27
@MichaReiser MichaReiser changed the title Red Knot Kick off Red-knot Apr 27, 2024
@MichaReiser MichaReiser enabled auto-merge (squash) April 27, 2024 08:29
@MichaReiser MichaReiser merged commit 7cd065e into main Apr 27, 2024
18 checks passed
@MichaReiser MichaReiser deleted the red-knot branch April 27, 2024 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants