-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the METADATA file of wheels in the simple API #8254
Comments
Sounds like a good idea. It would probably need to be an optional feature of the API, as we need to keep the spec backward-compatible, and really basic "serve a directory over HTTP" indexes might not be able to support the API. But that is a minor point. Basically +1 from me. One pip-specific note on timing, though. It looks like pip will get range request based metadata extraction before this API gets formalised/implemented. That's fine, but I think that when this does become available, pip should drop the range approach and switch to just using this. That would be a performance regression for indexes that support range requests but not the new API, but IMO that's more acceptable than carrying the support cost for having both approaches. |
I agree, it seems like a reasonable solution. If we design how the metadata is listed carefully, it’d likely also be reasonable for the files-in-a-directory use case to optionally implement. |
What would the filename of this file be? Something like Trying to think of alternatives: since METADATA is already RFC 822 compliant, we could include the metadata as headers on the response to requests for |
It would also be more challenging for mirrors like bandersnatch to implement, since they don't have any runtime components where they could add those headers, but the bigger thing is header's can't be protected by TUF, and we definitely want this to be TUF protected. The other option would be to embed this inside the TUF metadata itself, which is a JSON doc and has an area for arbitrary metadata to be added.. however I think that's worse for us since it's a much larger change in that sense, and sort of violates a bit of the separation of concerns we currently have with TUF. As far as file name, I don't really have a strong opinion on it. something like |
Got it, I wasn't thinking that TUF couldn't protect headers but that makes sense in retrospect. I don't see any significant issues with the proposal aside from the fact that PyPI will finally need to get into the business of unzipping/extracting/introspecting uploaded files. Do we think that should happen during the request (thus guaranteeing that the |
Within the legacy upload API we will probably want to do it inline? I don't know, that's a great question for whoever writes the actual PEP to figure out the implications of either choice 😄 . #7730 is probably the right long term solution to that particular problem. |
Alternatively it might be nice to provide the entire *.dist-info directory as a separable part. Or, going the other direction, METADATA without long-description. Of course it can be different per each individual wheel. |
I thought about the entire |
Agreed, anything other than So unless there's a specific use case, like there is for |
Off the top of my head the entry points are the most interesting metadata not in 'METADATA' |
Are we expecting to backfill metadata for a few versions of popular projects, particularly those that aren't released often? |
I quite like it. :)
👍 I really like that this makes it possible for static mirrors to provide this information! :)
My main concern is the same as @ofek -- how does this work with existing uploads? Would it make sense for PyPI to have a "backfill when requested" approach for existing uploads? |
I think we'd just backfill this for every |
Yea. It would take some time but we would presumably just do a backfill operation.
…Sent from my iPhone
On Jul 14, 2020, at 6:01 PM, Dustin Ingram ***@***.***> wrote:
I think we'd just backfill this for every .whl distribution that has a METADATA file in it?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
In pip at least we extract and parse |
Great idea! In fact, we should be able to list this |
Yep! This should be doable, as long as it's part of (or relationally connected to) the |
What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the |
Yep, exactly. We wouldn't need the |
I've read both PEPs, unfortunately still i don't understand how this mechanism would allow a client to differentiate between the case where a wheel has invalid metadata (so we can skip without extra network requests) and the case where the metadata is not provided by the server. |
If the wheel has genuinely invalid metadata, it's not installable, and should be deleted (or less drastically, yanked). How PyPI manages the process of yanking would need to be agreed, of course. From what I can tell, the problem here is that there's some argument over whether the metadata is invalid, or whether it's just that the extraction code being used by PyPI isn't sufficiently robust (or assumes that older uploads conform to the rules being applied now). If the concern is that the wheels without metadata can be installed, then marking them as "invalid metadata" would be wrong. So ultimately, I think that PyPI should classify wheels as follows:
The problem with (2) comes if PyPI wants to make it a feature (rather than a bug) that it hosts data that is invalid (in the sense that it's not installable, and doesn't follow PyPI's own rules for what is valid on upload). In that case, I guess exposing a custom field is the only real option, but you can't expect consumers like pip to respect that field (because that's just in effect making pip responsible for ignoring bad wheels rather than PyPI doing so - it's the same difficult choice that yanking would involve, but dumped on the client). Personally, as I said above, I'm fine with simply skipping metadata that can't be extracted. Practicality beats purity. |
A further comment - the point here is that a (spec compliant) wheel cannot have invalid metadata. "Having valid metadata" is part of the definition of a valid wheel. |
I think this going to happen infrequently enough that it doesn't make sense to put the additional effort in to handle this edge case. For context, we've only seen 164 wheels with "invalid" metadata out of the 3.2M files we've processed so far. Here's the breakdown by year, in case anyone was curious:
Reminder that we are processing files in reverse chronological order, so while I expect this number to continue to go up, I expect those releases to be less and less widely used. I don't think it would be completely unreasonable for PyPI to yank these non- spec-compliant wheels, but it would be unprecedented. |
Actually, if someone cared enough (I don't 😉) then I think it would be possible to record this distinction:
see here. |
I think I know the answer, but does PyPI block uploads of future wheels with invalid metadata? Once the backfill is completed, we can review the "corrupt" distributions and also look at their download stats to determine if any further action will be disruptive. It might be that the distributions already have no downloads, so can either be yanked or ignored. |
Yes. |
The backfill is complete, we have 303 projects that we've determined to be unbackfillable, I've created a gist with their details here: https://gist.github.com/di/dadce0a0359526323d09efc9309f1d22. |
I tried checking that list and some of them are not wheels (i.e., don't have a I spot checked one (compy-1.0.2-py3-none-any.whl) and it's not visible in the PyPI web UI. So I don't think these problem files matter. Edit: Correction, all the files in that list have URLs of the form The non-wheels do seem to simply be files marked (in the JSON API, so presumably in the upload UI somehow) as packagetype Thanks for doing this @di! |
I realised the urls in the gist are missing a I've updated and pushed the fixed paths here: https://gist.github.com/groodt/345aacb3795db63fe94735839824de87 (I'm not sure how / if to fork or merge against gists, or I would have simply pushed them to @di repo) I also noticed that many of the artifacts have the wrong extension (.egg, .zip and .tar.gz) but the ones that I looked at were displaying in the UI. |
Thanks - I edited my comment as I noticed this after the fact, too. The wrong extensions are due to bad filetypes in the underlying data, but installers should be checking the file extension so will likely never care. I'll need to do a bit more digging on the actual |
Pip prints slightly different log output depending on whether the package being installed has the new PEP 658 `.metadata` file available on PyPI. Until now, only packages uploaded to PyPI since the feature was implemented had this metadata generated by PyPI, however, the metadata file has now been backfilled for older packages too: pypi/warehouse#8254 (comment) As a result, the Pip support log output assertion needs updating for the new output, to fix CI on `main`: https://github.com/heroku/heroku-buildpack-python/actions/runs/8138313649/job/22238825835?pr=1545#step:5:479
Pip prints slightly different log output depending on whether the package being installed has the new PEP 658 `.metadata` file available on PyPI. Until now, only packages uploaded to PyPI since the feature was implemented had this metadata generated by PyPI, however, the metadata file has now been backfilled for older packages too: pypi/warehouse#8254 (comment) As a result, the Pip support log output assertion needs updating for the new output, to fix CI on `main`: https://github.com/heroku/heroku-buildpack-python/actions/runs/8138313649/job/22238825835?pr=1545#step:5:479 GUS-W-15172805.
Looking at the bad files, I see:
So 213 of the 303 problem files could have metadata extracted (probably). I doubt it's critical, even though there's a bunch of big (100-200MB) files in there (almost all torch or intel_tensorflow). I can probably write a script to extract the metadata from the files where it's possible to infer the correct file to extract. But I've no idea how we could take those extracted files and (safely, assuming "a bunch of stuff I sent to someone" isn't exactly a secure distribution method 😉) upload them to PyPI. So it's probably not worth the effort - certainly I've only been doing this out of intellectual curiosity, I don't have any need for this data. Footnotes
|
Only 215 digits ? Real programmers use math constants exact values as version numbers :D In order for this message to stay constructive: On a project with old dependencies, I've witnessed the resolution time of |
) Pip prints slightly different log output depending on whether the package being installed has the new PEP 658 `.metadata` file available on PyPI. Until now, only packages uploaded to PyPI since the feature was implemented had this metadata generated by PyPI, however, the metadata file has now been backfilled for older packages too: pypi/warehouse#8254 (comment) As a result, the Pip support log output assertion needs updating for the new output, to fix CI on `main`: https://github.com/heroku/heroku-buildpack-python/actions/runs/8138313649/job/22238825835?pr=1545#step:5:479 GUS-W-15172805.
The backfill code is using logic that pip uses, and pip also errors out in wheels with multiple dist-info files. :) |
Apologies, that was my typo.
Yes, PyPI thinks these are wheels, likely due to incorrect metadata being provided at upload time outside the wheel, at a time when we didn't do as much validation.
Thanks for all the additional analysis, but again, I think we'll probably just leave these as-is. I think the bar we want to set here is "if pip won't install it, don't bother with it".
Glad to hear it! I'd be interested to see some hard numbers on how much faster it is, if you're able to provide them. |
I hope there's a blog post or something about all of this. I'd be super curious to hear from @ewdurbin or anyone else who may know if there has been anything noticeable from a bandwidth or hosting perspective now that fewer chonky wheels are downloaded. |
lol, looks like I've been analyzing PyPI for so long now I'm super paranoid, far more than the tools I maintain are 🙂
Absolutely! As I said, this was mostly for my own curiosity, and I posted the results in case others were interested, but I don't think there's any reason to worry beyond that. |
That's so surprising! And that's at the CDN side right? Maybe all the clients cache effectively already, but the new metadata helps them resolve quicker and have smaller caches if they start from cold... It definitely feels noticeably quicker... I wonder where/how the best way to measure it might be. |
I don't think I can, sorry: my comment was based on an impression, and while I could measure the time now, it's probably not worth rolling back the whole thing so I can make the compared measurement for "before". If the part that depends on this was done in a CI step, I could compare, but this only happens on dependency resolution, which is a manual-only operation on this repo. |
Yes. My guess is that it's three things:
...so any gains here are largely dwarfed by our massive overall bandwidth and weekly growth. |
Regarding bandwidth and performance, there are some awesome pip PRs open by @cosmicexplorer that will also take advantage of the backfill: pypa/pip#12186, pypa/pip#12256, pypa/pip#12257, pypa/pip#12258 - e.g. quoting the second one:
# this branch is #12186, which this PR is based off of
> git checkout metadata_only_resolve_no_whl
> python3.8 -m pip install --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real 0m36.410s
user 0m15.706s
sys 0m13.377s
# switch to this PR
> git checkout link-metadata-cache
# enable --use-feature=metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real 0m5.671s
user 0m4.429s
sys 0m0.123s |
Currently a number of projects are trying to work around the fact that in order to resolve dependencies in Python you have to download the entire wheel in order to read the metadata. I am aware of two current strategies for working around this, one is the attempt to use the PyPI JSON API (which isn't a good solution because it's non standard, the data model is wrong, and it's not going to be secured by TUF) and the other is attempting to use range requests to fetch only the
METADATA
file from the wheel before downloading the entire wheel (which isn't a good solution because TUF can currently only verify entire files, and it depends on the server supporting range requests, which not every mirror is going to support).It seems to me like we could side step this issue by simply having PyPI extract the
METADATA
file of a wheel as part of the upload process, and storing that alongside the wheel itself. Within TUF we can ensure that these files have not been tampered with, by simply storing it as another TUF secured target. Resolvers could then download just the metadata file for a wheel they're considering as a candidate, instead of having to download the entire wheel.This is a pretty small delta over what already exists, so it's more likely we're going to get it done than any of the broader proposals of trying to design an entire, brand new repository API or by ALSO retrofitting the JSON API inside of TUF.
The main problems with it is that the
METADATA
file might also be larger than needed since it contains the entire long description of the wheel and that it still leaves sdists unsolved (but they're not currently really solvable). I don't think either problem is too drastic though.What do folks thinks? This would probably require a PEP and I probably don't have the spare cycles to do that right now, but I wanted to get the idea written down incase someone else felt like picking it up.
@pypa/pip-committers @pypa/pipenv-committers @sdispater (not sure who else work on poetry, feel free to CC more folks in).
The text was updated successfully, but these errors were encountered: