Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify DeclaredDependency and fix typing{-,_}extensions #193

Merged
merged 10 commits into from
Mar 3, 2023

Conversation

jherland
Copy link
Member

This started out as an attempt to fix #173, but also picked up some of the simplifications and fixing of unnecessary duplication that we discussed on Slack recently: https://tweag.slack.com/archives/C04AAQP8C85/p1677243757048469

This series does fix the typing-extensions vs typing_extensions issue in the middle, but there is quite a bit of refactoring happening before, and some very nice cleanup happening afterwards. In sum I think the refactoring is worthwhile, but if you'd rather have the fix committed in its own PR, I can have a look at pulling it out of this context.

  • types: Rename DependenciesMapping.DEPENDENCY_TO_IMPORT -> .LOCAL_ENV
  • check: Reorganize mapping of dependencies to import names
  • check.Package.normalize_name(): Replace hyphens with underscores
  • types: Remove mapping details from DeclaredDependency
  • types: Reduce duplication by changing .references into List[Location]

@jherland jherland added this to the First public release milestone Feb 28, 2023
@jherland jherland self-assigned this Feb 28, 2023
fawltydeps/check.py Outdated Show resolved Hide resolved
fawltydeps/check.py Outdated Show resolved Hide resolved
fawltydeps/check.py Outdated Show resolved Hide resolved
fawltydeps/check.py Outdated Show resolved Hide resolved
@jherland
Copy link
Member Author

After talking about this in today's status meeting, I will make a couple of commits that adds the mapping information back into the Analysis object, so that we can debug based on the JSON output.

@jherland jherland marked this pull request as draft March 1, 2023 10:00
@jherland jherland force-pushed the jherland/package-mapping-work branch 2 times, most recently from 98d196e to d97b6ad Compare March 2, 2023 10:41
@jherland
Copy link
Member Author

jherland commented Mar 2, 2023

Ok, this is ready for a new round of reviews now.

The only substantive change (in addition to rebasing) is adding two commits at the end that expose the dep-to-import mappings in the Analysis object (and therefore in our JSON output): There is a new Analysis.resolved_deps field that show the dep-to-Package mappings that were involved in calculating the result.

Here is an abbreviated look at what that looks like in out JSON output (running fawltydeps --json in the FD repo, and showing both mappings that were resolved through the local env as well as identity mappings):

{
  "settings": { ... },
  "imports": [ ... ],
  "declared_deps": [ ... ],
  "resolved_deps": {
    "importlib_metadata": {
        "package_name": "importlib-metadata",
        "import_names": ["importlib_metadata"],
        "mappings": ["local_env"]
    },
    ...
    "typing-extensions": {
        "package_name": "typing_extensions",
        "import_names": ["typing_extensions"],
        "mappings": ["local_env"]
    },
    "pytest": {
        "package_name": "pytest",
        "import_names": ["_pytest", "py", "pytest"],
        "mappings": ["local_env"]
    },
    ...
    "scipy": {
        "package_name": "scipy",
        "import_names": ["scipy"],
        "mappings": ["identity"]
    },
    "numpy": {
        "package_name": "numpy",
        "import_names": ["numpy"],
        "mappings": ["identity"]
    }
  },
  "undeclared_deps": [ ... ],
  "unused_deps": [ ... ],
  "version": "0.4.1"
}

@jherland jherland marked this pull request as ready for review March 2, 2023 10:53
@jherland jherland requested a review from mknorps March 2, 2023 10:53
fawltydeps/types.py Outdated Show resolved Hide resolved
jherland added a commit that referenced this pull request Mar 3, 2023
As Maria suggests in [1], knowing which import name came from which
mapping can be very useful when debugging. We achieve this by combining
.mappings and .import_names into a dict that associates each mapping
with the import names it contributes.

However, although now redundant, we still want to hang on to the
flattened .import_names set for checking if each declared dep is used.

Therefore we assign the new dict to .mappings and keep the redundant
.import_names, except that we hide it from our JSON output.

[1]: #193 (comment)
Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very clear refactor and a good idea to create a Package class to govern mappings and to simplify the UnusedDependency and UndeclaredDepencency.

What I would love to see more of (and my main concern here) are tests for Package class methods. They are tested indirectly in other places, but it would fit to include them in test_types.py.

Apart from that, more of a nit-pick, is that LocalPackageLookup is now more related to Packages than to dependencies and I would see it separate from check.py and gathered in one module with Package.

tests/test_compare_imports_to_dependencies.py Show resolved Hide resolved
@jherland
Copy link
Member Author

jherland commented Mar 3, 2023

What I would love to see more of (and my main concern here) are tests for Package class methods. They are tested indirectly in other places, but it would fit to include them in test_types.py.

Agreed. Will add some Package tests.

jherland added 8 commits March 3, 2023 11:50
LOCAL_ENV is more descriptive for describing the source of this mapping
from package name to import names: we query the current/local Python
environment with importlib_metadata.packages_distributions() to get this
mapping.
This moves around a lot of code, but the ultimate behavior is unchanged
(no tests are changed in a meaningful way):

- We introduce a new Package class whose sole purpose is to encapsulate
  the mapping from package (i.e. dependency) names to an associated set
  of import names. This mapping will replace the similar mapping that
  lived inside the DeclaredDependency class.
  - This class also holds the logic to _normalize_ package names. For
    now, this consists only of lowercasing them, but this will change
    soon.
  - Note that the .package_name member is NOT meant to be normalized,
    but should reflect the spelling of the source from which it was
    created (packages_distributions() in the case of LOCAL_ENV, or the
    DeclaredDependency object in case of IDENTITY). Instead the
    normalization will happen in the surrounding code when lookup is
    performed (see e.g. LocalPackageLookup.lookup_package()).
  - Also, the .is_used() helps simplify the matching of imports from the
    code (aka. imported_names) against the import names exposed by the
    package.
  - As a temporary measure, we also include a .modify() method which
    will modify/replace a DeclaredDependency instance in the same way as
    the old dependency_to_imports_mapping() function did. Although this
    modification/replacement is no longer necessary, it allows us to
    delay major test changes until later commit.
- The new Package class allows the LocalPackageLookup to become somewhat
  simpler: Its constructor still calls packages_distributions(), but now
  builds a dict of Packages objects to cache the extracted information,
  and the .lookup_package() method becomes a simple lookup in this dict.
- Finally, the new resolve_dependencies() function takes over what the
  dependency_to_imports_mapping() and map_dependencies_to_imports()
  functions previously did: Create a local package lookup, and map all
  declared dependencies into a Package object - either one found in the
  local environment, or a fallback pseudo-Package encoding the identity
  mapping.

Instead of the dependency-to-import-names mapping living _inside_ each
DeclaredDependency object, the mapping is now explicitly returned from
resolved_dependencies() as a Dict[str, Package] mapping. The matching/
calculation performed in compare_imports_to_dependencies proceeds
largely as before, albeit with some name changes that IMHO improve
readability.

Althought the tests are left unchanged as much as possible, there are
some necessary changes in test_map_dep_name_to_import_names due to the
API changes described above.
Some dependencies (notably "typing-extensions") have a package name that
contain a hyphen ('-'), but provide an import name where that hyphen has
been replaced with an underscore ('_'). Adding this normalization to our
Package class fixes real-world cases, and brings us closer to the kind
of normalization already performed by similar implementations in this
space, e.g. the various safe_* functions in setuptools:pkg_resources,
or the Pants build system[1]

[1]: e.g. look at the map_third_party_modules_to_addresses() method of
ThirdPartyPythonModuleMapping in release_2.15.0:
src/python/pants/backend/python/dependency_inference/module_mapper.py:319
There is no longer any need to store the import names and mapping enum
details _inside_ the DeclaredDependency class. If we do want these
details in the (e.g. JSON) output of FawltyDeps, we can instead add the
corresponding Package objects back into the UnusedDependency objects.

For now, though, simply remove the .import_names and .mapping members
from the DeclaredDependency class, along with associated code and test
cases.

This helps cut down on the duplication inside the internal Analysis data
structure, which is also reflected in the JSON output.
Until now we had:

    class UndeclaredDependency:
        name: str
	references: List[ParsedImport]

    class UnusedDependency:
        name: str
	references: List[DeclaredDependency]

but both ParsedImport and DeclaredDependency contained their own .name
which was identical to the corresponding .name above. This duplication
is unnecessary, and we can avoid it by storing the .source member of the
latter objects in the .references lists of the respective former
objects. In other words:

    class UndeclaredDependency:
        name: str
	references: List[Location]

    class UnusedDependency:
        name: str
	references: List[Location]

The simpler structure is nicely reflected in some shorter test cases.
This is in preparation for including Package objects in the overall
Analysis data structure.
When we removed .import_names and .mapping from DeclaredDependency, we
also remove any information in our JSON output that could help debug the
dependency-to-import-names mapping that FD performs.

This commits adds that information back in a more useful format. Instead
of duplicating it across many DelaredDependency objects, we now add a
dedicated field to the top-level Analysis object called .resolved_deps.

Analysis.resolved_deps is a mapping from dependency names (as they occur
in the --deps files we have parsed) to the corresponding Package object,
which contains information on what import names that FawltyDeps found,
and which mapping it used to find those import names (currently either
via the local environment, or by falling back to the identity mapping).
As Maria suggests in [1], knowing which import name came from which
mapping can be very useful when debugging. We achieve this by combining
.mappings and .import_names into a dict that associates each mapping
with the import names it contributes.

However, although now redundant, we still want to hang on to the
flattened .import_names set for checking if each declared dep is used.

Therefore we assign the new dict to .mappings and keep the redundant
.import_names, except that we hide it from our JSON output.

[1]: #193 (comment)
@jherland jherland force-pushed the jherland/package-mapping-work branch from 25ec61c to 5392788 Compare March 3, 2023 10:50
@jherland jherland mentioned this pull request Mar 3, 2023
5 tasks
Suggested-by: Maria Knorps <maria.knorps@tweag.io>
@jherland jherland requested a review from mknorps March 3, 2023 14:33
Copy link
Collaborator

@mknorps mknorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the tests! they are very helpful and descriptive 🎉

🚀

We have one discussion left, but this is very corner-casy stuff. The PR is good to merge :)

This verifies that deps declared with only case differences ("Pip" vs
"pip", which is interpreted as the _same_ package by "pip install")
still end up being reported as two different unused dependencies.
@jherland jherland merged commit 1ecaab4 into main Mar 3, 2023
@jherland jherland deleted the jherland/package-mapping-work branch March 3, 2023 23:56
jherland added a commit that referenced this pull request Mar 3, 2023
As Maria suggests in [1], knowing which import name came from which
mapping can be very useful when debugging. We achieve this by combining
.mappings and .import_names into a dict that associates each mapping
with the import names it contributes.

However, although now redundant, we still want to hang on to the
flattened .import_names set for checking if each declared dep is used.

Therefore we assign the new dict to .mappings and keep the redundant
.import_names, except that we hide it from our JSON output.

[1]: #193 (comment)
jherland added a commit that referenced this pull request Mar 6, 2023
We will soon remove compare_imports_to_dependencies(), but we have a lot
of parametrized tests for this function that we want to retain.

Turn each test vector into a FDTestVector instance to make the
definition of each vector more readable, and then use the same test data
to test resolve_dependencies(), calculate_undeclared(), and
calculate_unused() in addition to compare_imports_to_dependencies().

Based on Maria's great idea in [1].

[1]: #193 (comment)
jherland added a commit that referenced this pull request Mar 6, 2023
We will soon remove compare_imports_to_dependencies(), but we have a lot
of parametrized tests for this function that we want to retain.

Turn each test vector into a FDTestVector instance to make the
definition of each vector more readable, and then use the same test data
to test resolve_dependencies(), calculate_undeclared(), and
calculate_unused() in addition to compare_imports_to_dependencies().

Based on Maria's great idea in [1].

[1]: #193 (comment)
jherland added a commit that referenced this pull request Mar 8, 2023
We will soon remove compare_imports_to_dependencies(), but we have a lot
of parametrized tests for this function that we want to retain.

Turn each test vector into a FDTestVector instance to make the
definition of each vector more readable, and then use the same test data
to test resolve_dependencies(), calculate_undeclared(), and
calculate_unused() in addition to compare_imports_to_dependencies().

Based on Maria's great idea in [1].

[1]: #193 (comment)
jherland added a commit that referenced this pull request Mar 9, 2023
We will soon remove compare_imports_to_dependencies(), but we have a lot
of parametrized tests for this function that we want to retain.

Turn each test vector into a FDTestVector instance to make the
definition of each vector more readable, and then use the same test data
to test resolve_dependencies(), calculate_undeclared(), and
calculate_unused() in addition to compare_imports_to_dependencies().

Based on Maria's great idea in [1].

[1]: #193 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

typing-extensions vs. typing_extensions
2 participants