Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the METADATA file of wheels in the simple API #8254

Closed
dstufft opened this issue Jul 13, 2020 · 124 comments · Fixed by #15526
Closed

Expose the METADATA file of wheels in the simple API #8254

dstufft opened this issue Jul 13, 2020 · 124 comments · Fixed by #15526

Comments

@dstufft
Copy link
Member

dstufft commented Jul 13, 2020

Currently a number of projects are trying to work around the fact that in order to resolve dependencies in Python you have to download the entire wheel in order to read the metadata. I am aware of two current strategies for working around this, one is the attempt to use the PyPI JSON API (which isn't a good solution because it's non standard, the data model is wrong, and it's not going to be secured by TUF) and the other is attempting to use range requests to fetch only the METADATA file from the wheel before downloading the entire wheel (which isn't a good solution because TUF can currently only verify entire files, and it depends on the server supporting range requests, which not every mirror is going to support).

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself. Within TUF we can ensure that these files have not been tampered with, by simply storing it as another TUF secured target. Resolvers could then download just the metadata file for a wheel they're considering as a candidate, instead of having to download the entire wheel.

This is a pretty small delta over what already exists, so it's more likely we're going to get it done than any of the broader proposals of trying to design an entire, brand new repository API or by ALSO retrofitting the JSON API inside of TUF.

The main problems with it is that the METADATA file might also be larger than needed since it contains the entire long description of the wheel and that it still leaves sdists unsolved (but they're not currently really solvable). I don't think either problem is too drastic though.

What do folks thinks? This would probably require a PEP and I probably don't have the spare cycles to do that right now, but I wanted to get the idea written down incase someone else felt like picking it up.

@pypa/pip-committers @pypa/pipenv-committers @sdispater (not sure who else work on poetry, feel free to CC more folks in).

@dstufft dstufft added feature request help needed We'd love volunteers to advise on or help fix/implement this. needs discussion a product management/policy issue maintainers and users should discuss labels Jul 13, 2020
@pfmoore
Copy link
Contributor

pfmoore commented Jul 13, 2020

Sounds like a good idea. It would probably need to be an optional feature of the API, as we need to keep the spec backward-compatible, and really basic "serve a directory over HTTP" indexes might not be able to support the API.

But that is a minor point. Basically +1 from me.

One pip-specific note on timing, though. It looks like pip will get range request based metadata extraction before this API gets formalised/implemented. That's fine, but I think that when this does become available, pip should drop the range approach and switch to just using this. That would be a performance regression for indexes that support range requests but not the new API, but IMO that's more acceptable than carrying the support cost for having both approaches.

@uranusjr
Copy link
Contributor

I agree, it seems like a reasonable solution. If we design how the metadata is listed carefully, it’d likely also be reasonable for the files-in-a-directory use case to optionally implement.

@di
Copy link
Member

di commented Jul 13, 2020

What would the filename of this file be? Something like pip-20.1.1-py2.py3-none-any.whl.METADATA?

Trying to think of alternatives: since METADATA is already RFC 822 compliant, we could include the metadata as headers on the response to requests for .whl files. Clients that only want the metadata could call HEAD on the URL, clients that want both metadata and the .whl file itself would call GET and get both in a single request. This would be a bit more challenging for PyPI to implement, though.

@dstufft
Copy link
Member Author

dstufft commented Jul 13, 2020

It would also be more challenging for mirrors like bandersnatch to implement, since they don't have any runtime components where they could add those headers, but the bigger thing is header's can't be protected by TUF, and we definitely want this to be TUF protected.

The other option would be to embed this inside the TUF metadata itself, which is a JSON doc and has an area for arbitrary metadata to be added.. however I think that's worse for us since it's a much larger change in that sense, and sort of violates a bit of the separation of concerns we currently have with TUF.

As far as file name, I don't really have a strong opinion on it. something like pip-20.1.1-py2.py3-none-any.whl.METADATA works fine for me, there's a very clear marker for what file the metadata belongs to, and in the "serve a directory over HTTP" index, they could easily add that file too.

@di
Copy link
Member

di commented Jul 13, 2020

Got it, I wasn't thinking that TUF couldn't protect headers but that makes sense in retrospect.

I don't see any significant issues with the proposal aside from the fact that PyPI will finally need to get into the business of unzipping/extracting/introspecting uploaded files. Do we think that should happen during the request (thus guaranteeing that the METADATA file is available immediately after upload, but potentially slowing down the request) or can it happen outside the request (by kicking off a background task)?

@dstufft
Copy link
Member Author

dstufft commented Jul 13, 2020

Within the legacy upload API we will probably want to do it inline? I don't know, that's a great question for whoever writes the actual PEP to figure out the implications of either choice 😄 . #7730 is probably the right long term solution to that particular problem.

@dholth
Copy link

dholth commented Jul 13, 2020

Alternatively it might be nice to provide the entire *.dist-info directory as a separable part. Or, going the other direction, METADATA without long-description. Of course it can be different per each individual wheel.

@dstufft
Copy link
Member Author

dstufft commented Jul 13, 2020

I thought about the entire .dist-info directory. If we did that we would probably want to re-zip it into a single artifact, It just didn't feel super worthwhile to me as I couldn't think of a use case for accessing files other than METADATA as part of the resolution/install process, which is all this idea really cared about. Maybe there's something I'm not thinking about though?

@pfmoore
Copy link
Contributor

pfmoore commented Jul 13, 2020

Agreed, anything other than METADATA feels like YAGNI. After all, the only standardised files in .dist-info are METADATA, RECORD and WHEEL. RECORD is not much use without the full wheel, and there's not enough in WHEEL to be worth exposing separately.

So unless there's a specific use case, like there is for METADATA, I'd say let's not bother.

@dholth
Copy link

dholth commented Jul 13, 2020

Off the top of my head the entry points are the most interesting metadata not in 'METADATA'

@ofek
Copy link
Contributor

ofek commented Jul 14, 2020

Are we expecting to backfill metadata for a few versions of popular projects, particularly those that aren't released often?

@pradyunsg
Copy link
Contributor

What do folks think?

I quite like it. :)

pip-20.1.1-py2.py3-none-any.whl.METADATA

👍 I really like that this makes it possible for static mirrors to provide this information! :)

not sure who else work on poetry, feel free to CC more folks in

@abn @finswimmer @stephsamson


My main concern is the same as @ofek -- how does this work with existing uploads? Would it make sense for PyPI to have a "backfill when requested" approach for existing uploads?

@di
Copy link
Member

di commented Jul 14, 2020

I think we'd just backfill this for every .whl distribution that has a METADATA file in it?

@dstufft
Copy link
Member Author

dstufft commented Jul 14, 2020 via email

@chrahunt
Copy link

and there's not enough in WHEEL to be worth exposing separately

In pip at least we extract and parse WHEEL first, to see if we can even understand the format of the wheel. In a future where we actually want to exercise that versioning mechanism, if we make WHEEL available from the start then we can avoid considering new wheels we wouldn't be able to use. If we don't take that approach then projects may hesitate to release new wheels because it would cause users' pips to fully resolve then backtrack (or error out) when encountering a new wheel once downloaded.

@trishankatdatadog
Copy link
Contributor

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself.

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

@woodruffw
Copy link
Member

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

Yep! This should be doable, as long as it's part of (or relationally connected to) the Release or File models.

@dstufft
Copy link
Member Author

dstufft commented Jul 15, 2020

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

@woodruffw
Copy link
Member

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

Yep, exactly. We wouldn't need the METADATA filename itself stored, assuming that it can be inferred (i.e. that it's always {release_file}.METADATA).

@konstin
Copy link
Contributor

konstin commented Feb 29, 2024

I've read both PEPs, unfortunately still i don't understand how this mechanism would allow a client to differentiate between the case where a wheel has invalid metadata (so we can skip without extra network requests) and the case where the metadata is not provided by the server.

@pfmoore
Copy link
Contributor

pfmoore commented Feb 29, 2024

to differentiate between the case where a wheel has invalid metadata (so we can skip without extra network requests) and the case where the metadata is not provided by the server

If the wheel has genuinely invalid metadata, it's not installable, and should be deleted (or less drastically, yanked). How PyPI manages the process of yanking would need to be agreed, of course.

From what I can tell, the problem here is that there's some argument over whether the metadata is invalid, or whether it's just that the extraction code being used by PyPI isn't sufficiently robust (or assumes that older uploads conform to the rules being applied now). If the concern is that the wheels without metadata can be installed, then marking them as "invalid metadata" would be wrong.

So ultimately, I think that PyPI should classify wheels as follows:

  1. Metadata has been extracted - no issues, all done.
  2. Metadata could not be extracted, wheel is unarguably corrupt and not installable - wheel should be yanked, and a process needs to be sorted out for that to happen.
  3. Metadata could not be extracted, wheel takes a different interpretation of the spec than the extraction code - fix the extraction code and re-run it on those wheels. In the meantime, consumers will take a (small) performance hit because the metadata extraction is incomplete.

The problem with (2) comes if PyPI wants to make it a feature (rather than a bug) that it hosts data that is invalid (in the sense that it's not installable, and doesn't follow PyPI's own rules for what is valid on upload). In that case, I guess exposing a custom field is the only real option, but you can't expect consumers like pip to respect that field (because that's just in effect making pip responsible for ignoring bad wheels rather than PyPI doing so - it's the same difficult choice that yanking would involve, but dumped on the client).

Personally, as I said above, I'm fine with simply skipping metadata that can't be extracted. Practicality beats purity.

@pfmoore
Copy link
Contributor

pfmoore commented Feb 29, 2024

a wheel has invalid metadata

A further comment - the point here is that a (spec compliant) wheel cannot have invalid metadata. "Having valid metadata" is part of the definition of a valid wheel.

@di
Copy link
Member

di commented Feb 29, 2024

I think this going to happen infrequently enough that it doesn't make sense to put the additional effort in to handle this edge case. For context, we've only seen 164 wheels with "invalid" metadata out of the 3.2M files we've processed so far.

Here's the breakdown by year, in case anyone was curious:

warehouse=> SELECT EXTRACT(YEAR FROM release_year) AS release_year, COUNT(*)
FROM (
    SELECT DATE_TRUNC('year', upload_time) AS release_year
    FROM release_files
    WHERE metadata_file_unbackfillable IS TRUE
) AS subquery
GROUP BY release_year;
 release_year | count
--------------+-------
         2019 |    13
         2020 |    44
         2021 |    52
         2022 |    48
         2023 |    11
(5 rows)

Reminder that we are processing files in reverse chronological order, so while I expect this number to continue to go up, I expect those releases to be less and less widely used.

I don't think it would be completely unreasonable for PyPI to yank these non- spec-compliant wheels, but it would be unprecedented.

@pfmoore
Copy link
Contributor

pfmoore commented Feb 29, 2024

Actually, if someone cared enough (I don't 😉) then I think it would be possible to record this distinction:

  1. core-metadata field missing => backfill not completed yet
  2. core-metadata is False => backfill completed, metadata could not be extracted
  3. core-metadata is True or a hash dictionary => backfill completed, metadata is present

see here.

@groodt
Copy link

groodt commented Mar 3, 2024

I think I know the answer, but does PyPI block uploads of future wheels with invalid metadata?

Once the backfill is completed, we can review the "corrupt" distributions and also look at their download stats to determine if any further action will be disruptive. It might be that the distributions already have no downloads, so can either be yanked or ignored.

@di
Copy link
Member

di commented Mar 3, 2024

I think I know the answer, but does PyPI block uploads of future wheels with invalid metadata?

Yes.

@di
Copy link
Member

di commented Mar 3, 2024

The backfill is complete, we have 303 projects that we've determined to be unbackfillable, I've created a gist with their details here: https://gist.github.com/di/dadce0a0359526323d09efc9309f1d22.

@pfmoore
Copy link
Contributor

pfmoore commented Mar 4, 2024

I've created a gist with their details here

I tried checking that list and some of them are not wheels (i.e., don't have a .whl extension). The rest are giving 404 errors.

I spot checked one (compy-1.0.2-py3-none-any.whl) and it's not visible in the PyPI web UI. So I don't think these problem files matter.

Edit: Correction, all the files in that list have URLs of the form https://files.pythonhosted.org/packages1f/a8/.... Note the missing slash after the packages component of the path. But I found a different package (/multiple_docking-0.5-py3-none-any.whl) which is visible and downloadable in the web UI, and which has the correct URL in the simple index and the JSON API, suggesting that something else went wrong.

The non-wheels do seem to simply be files marked (in the JSON API, so presumably in the upload UI somehow) as packagetype bdist_wheel when that's not actually the case. So those are simply bad data, and can be ignored for the purposes of this exercise.

Thanks for doing this @di!

@groodt
Copy link

groodt commented Mar 4, 2024

I realised the urls in the gist are missing a /. I'm not sure if thats from the creation of the gist or from some other mechanism. When I fixed the urls, I was able to download the binaries.

I've updated and pushed the fixed paths here: https://gist.github.com/groodt/345aacb3795db63fe94735839824de87 (I'm not sure how / if to fork or merge against gists, or I would have simply pushed them to @di repo)

I also noticed that many of the artifacts have the wrong extension (.egg, .zip and .tar.gz) but the ones that I looked at were displaying in the UI.

Screenshot 2024-03-04 at 10 04 46 pm

@pfmoore
Copy link
Contributor

pfmoore commented Mar 4, 2024

Thanks - I edited my comment as I noticed this after the fact, too. The wrong extensions are due to bad filetypes in the underlying data, but installers should be checking the file extension so will likely never care. I'll need to do a bit more digging on the actual .whl files.

edmorley added a commit to heroku/heroku-buildpack-python that referenced this issue Mar 4, 2024
Pip prints slightly different log output depending on whether the
package being installed has the new PEP 658 `.metadata` file available
on PyPI.

Until now, only packages uploaded to PyPI since the feature was
implemented had this metadata generated by PyPI, however, the metadata
file has now been backfilled for older packages too:
pypi/warehouse#8254 (comment)

As a result, the Pip support log output assertion needs updating for
the new output, to fix CI on `main`:
https://github.com/heroku/heroku-buildpack-python/actions/runs/8138313649/job/22238825835?pr=1545#step:5:479
edmorley added a commit to heroku/heroku-buildpack-python that referenced this issue Mar 4, 2024
Pip prints slightly different log output depending on whether the
package being installed has the new PEP 658 `.metadata` file available
on PyPI.

Until now, only packages uploaded to PyPI since the feature was
implemented had this metadata generated by PyPI, however, the metadata
file has now been backfilled for older packages too:
pypi/warehouse#8254 (comment)

As a result, the Pip support log output assertion needs updating for
the new output, to fix CI on `main`:
https://github.com/heroku/heroku-buildpack-python/actions/runs/8138313649/job/22238825835?pr=1545#step:5:479

GUS-W-15172805.
@pfmoore
Copy link
Contributor

pfmoore commented Mar 4, 2024

Looking at the bad files, I see:

  • Multiple .dist-info directories in the wheel (168 cases). Not actually an error, as you should pick the one with the right project name and version, but naive code doesn't always do that. This might be a bug in the backfill code.
  • The project name in the .dist-info directory is not normalised correctly (44 cases). Technically an error, but recoverable if you normalize the name before matching and otherwise don't care 😉
  • No .dist-info matching the project name and version (51 cases). This is a genuine error, and the wheel isn't installable.
  • There is a .dist-info directory, but it doesn't contain a METADATA file (3 cases). This is a genuine error, and the wheel isn't installable1.
  • File is not a .whl file (28 cases).
  • URL can't be fetched (7 cases).
  • There's one file with a stupidly long version number: uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl. This may well break things - Windows won't even store it under its full name.
  • One file, stochatreat-0.0.1-py3-none-any.whl looks valid, except that it has no content except the .dist-info directory...

So 213 of the 303 problem files could have metadata extracted (probably). I doubt it's critical, even though there's a bunch of big (100-200MB) files in there (almost all torch or intel_tensorflow).

I can probably write a script to extract the metadata from the files where it's possible to infer the correct file to extract. But I've no idea how we could take those extracted files and (safely, assuming "a bunch of stuff I sent to someone" isn't exactly a secure distribution method 😉) upload them to PyPI. So it's probably not worth the effort - certainly I've only been doing this out of intellectual curiosity, I don't have any need for this data.

Footnotes

  1. One of these cases, os_sys-1.9.3 is actually even worse. Every file in the wheel has a leading slash in the name, including the real .dist-info directory. However, the RECORD file is written with the correct .dist-info name (no leading slash), making it look like there's a .dist-info with no metadata file. I don't know how people manage to create such weirdly corrupt wheels. And all of the project links are broken, so this looks very much like abandonware at best, or something deliberately malicious at worst...

@ewjoachim
Copy link
Contributor

uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl. This may well break things - Windows won't even store it under its full name.

Only 215 digits ? Real programmers use math constants exact values as version numbers :D

In order for this message to stay constructive: On a project with old dependencies, I've witnessed the resolution time of poetry add to be much reduced in the last weeks, and I wanted to thank everyone involved.

edmorley added a commit to heroku/heroku-buildpack-python that referenced this issue Mar 4, 2024
)

Pip prints slightly different log output depending on whether the
package being installed has the new PEP 658 `.metadata` file available
on PyPI.

Until now, only packages uploaded to PyPI since the feature was
implemented had this metadata generated by PyPI, however, the metadata
file has now been backfilled for older packages too:
pypi/warehouse#8254 (comment)

As a result, the Pip support log output assertion needs updating for
the new output, to fix CI on `main`:
https://github.com/heroku/heroku-buildpack-python/actions/runs/8138313649/job/22238825835?pr=1545#step:5:479

GUS-W-15172805.
@pradyunsg
Copy link
Contributor

Not actually an error, as you should pick the one with the right project name and version, but naive code doesn't always do that. This might be a bug in the backfill code.

The backfill code is using logic that pip uses, and pip also errors out in wheels with multiple dist-info files. :)

@di
Copy link
Member

di commented Mar 4, 2024

I realised the urls in the gist are missing a \

Apologies, that was my typo.

I also noticed that many of the artifacts have the wrong extension (.egg, .zip and .tar.gz) but the ones that I looked at were displaying in the UI.

Yes, PyPI thinks these are wheels, likely due to incorrect metadata being provided at upload time outside the wheel, at a time when we didn't do as much validation.

Looking at the bad files, I see:

Thanks for all the additional analysis, but again, I think we'll probably just leave these as-is. I think the bar we want to set here is "if pip won't install it, don't bother with it".

In order for this message to stay constructive: On a project with old dependencies, I've witnessed the resolution time of poetry add to be much reduced in the last weeks, and I wanted to thank everyone involved.

Glad to hear it! I'd be interested to see some hard numbers on how much faster it is, if you're able to provide them.

@groodt
Copy link

groodt commented Mar 4, 2024

I hope there's a blog post or something about all of this. I'd be super curious to hear from @ewdurbin or anyone else who may know if there has been anything noticeable from a bandwidth or hosting perspective now that fewer chonky wheels are downloaded.

@di
Copy link
Member

di commented Mar 4, 2024

Here's bandwidth since the start of the year, looks like it was actually up last week, to the highest point ever:

image

Requests are up as well:

image

The only thing that really seems to be down is 3XX redirects:

image

@pfmoore
Copy link
Contributor

pfmoore commented Mar 4, 2024

The backfill code is using logic that pip uses, and pip also errors out in wheels with multiple dist-info files.

lol, looks like I've been analyzing PyPI for so long now I'm super paranoid, far more than the tools I maintain are 🙂

Thanks for all the additional analysis, but again, I think we'll probably just leave these as-is. I think the bar we want to set here is "if pip won't install it, don't bother with it".

Absolutely! As I said, this was mostly for my own curiosity, and I posted the results in case others were interested, but I don't think there's any reason to worry beyond that.

@groodt
Copy link

groodt commented Mar 4, 2024

Here's bandwidth since the start of the year, looks like it was actually up last week, to the highest point ever:

That's so surprising! And that's at the CDN side right? Maybe all the clients cache effectively already, but the new metadata helps them resolve quicker and have smaller caches if they start from cold...

It definitely feels noticeably quicker... I wonder where/how the best way to measure it might be.

@ewjoachim
Copy link
Contributor

ewjoachim commented Mar 4, 2024

Glad to hear it! I'd be interested to see some hard numbers on how much faster it is, if you're able to provide them.

I don't think I can, sorry: my comment was based on an impression, and while I could measure the time now, it's probably not worth rolling back the whole thing so I can make the compared measurement for "before". If the part that depends on this was done in a CI step, I could compare, but this only happens on dependency resolution, which is a manual-only operation on this repo.

@di
Copy link
Member

di commented Mar 4, 2024

And that's at the CDN side right?

Yes. My guess is that it's three things:

  • most resolvers actually land on a compatible version fairly quickly
  • this won't make a difference when installing from a lockfile
  • people are probably still using installers that can't take advantage of this, including old versions of pip

...so any gains here are largely dwarfed by our massive overall bandwidth and weekly growth.

@ddelange
Copy link

ddelange commented Mar 5, 2024

Regarding bandwidth and performance, there are some awesome pip PRs open by @cosmicexplorer that will also take advantage of the backfill: pypa/pip#12186, pypa/pip#12256, pypa/pip#12257, pypa/pip#12258 - e.g. quoting the second one:

This change produces a 6.5x speedup against the example tested below, reducing the runtime of this pip install --report command from over 36 seconds down to just 5.7 seconds:

# this branch is #12186, which this PR is based off of
> git checkout metadata_only_resolve_no_whl
> python3.8 -m pip install --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m36.410s
user    0m15.706s
sys     0m13.377s
# switch to this PR
> git checkout link-metadata-cache
# enable --use-feature=metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.671s
user    0m4.429s
sys     0m0.123s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.