-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a metadata API #383
Comments
|
I'll cross-link the #332 draft PR here. It contains no actual validation logic, and is only focused on defining an API that tries to align with what was being discussed in https://discuss.python.org/t/package-metadata-description-field/4508/ -- basically your first approach. It seems like regardless of the approach, using |
@uranusjr It basically returns a dictionary. And I don't think there needs to be a direct discussion on that as the target audiences are different: we are looking at tool support for packaging front-end and back-ends while @di that PR is what instigated all of this. 😄 With PEP 621 thrown into the mix since that PR was opened I tried to think of an API where the three-way data/no-data/dynamic state could be cleanly represented, and dicts came the closest for me. As for |
Yes, with inheritance. Does that still work w/ what you're thinking about? |
I was worrying about |
It should (and the PEP says inheritance is supported). The one question would be what to do with One is to just leave it as a Two, update the potential literal values at each bump and only support returning the latest one (but that assumes types don't ever change for existing keys). This would mean we return whatever the newest Three, we do whatever inheritance tricks we need to for superclasses and have an explicit literal for each metadata version and have all functions return the union of all possible metadata types. That's the most explicit and the most accurate going forward. And to to be clear about that third option, I'm thinking: class RequiredMetadata(typing.TypeDict):
name: str
version: Optional[version.Version]
...
class CommonMetadata(RequiredMetadata, total=False):
...
class MetaData2_2(CommonMetadata):
metadata_version: typing.Literal["2.2"]
class Metadata2_1(CommonMetadata):
metadata_version: typing.Literal["2.1"]
... There's also the question of naming (for either dicts or objects). Would we use the capitalization approach of core metadata spec or lower-case like how it's written in PEP 621 in TOML? And what about dashes? If we use the alternative syntax for |
@uranusjr one trick with no file access is PEP 621 and the |
Hi there.
I prototyped & used both forms, and, as this is of particular interest to me, let me share my perspective on the pros and cons of each. For the dict form, I used Paul Moore's gist. This approach makes for the most concise implementation, and puts the spec front and center. But it is an unimaginable pain to have to work with the metadata keys in the regime where an arbitrary string is accepted. Without looking at the spec / code, could you tell which of these three triplets contain only correct keys?
That's right, none of them 🙃. The issue is alleviated somewhat by naively converting to the json form on the fly ( I guess this is mostly fixed by The dataclass style, on the other hand, provides many opportunities for improving usability, depending on how far away would you like to get carried with it. So far my experience with it is that the further you go, the less the spec is visible in the API - maybe that's a good thing? My implementation goes quite far in that direction, mostly because I need to teach this stuff to an audience that isn't very familiar with the subject, and I don't want to waste time on debugging silly mistakes. Disclaimer: I had to make this far longer than it should be, cause the usecase for the v0.0.1 forbids me from using I tried the following:
Code completion makes using this a breeze, and improves discoverability. While this wins on the usability front - the poka-yoke of it - the big con of all of that is simply: the implementation feels like taking out a bazooka to shoot down a fly. Converting between names of attributes and the keys from the spec is counter-intuitive, and the fact that the spec keys doesn't have consistent (i.e. algorithmic) casing doesn't help. Inheritance between classes that implement different version of this spec is probably not possible nor advisable - DRY would probably have to be violated in places. I would be eternally grateful if any of you would provide any thoughts on this. This should by no means be read as me trying to advertise my implementation for inclusion in |
The second is actually valid, if I read it correctly. Since |
Maybe you meant the first one - then yes, that's correct, if you assume case insensitivity.
Is it correct to assume that all tools that are made up-to-spec use metadata keys in a case-insensitive manner? That's a big assumption that isn't explicitly in the spec. |
All tools? Of course not 😉 But it is my understanding that common tools (basically |
Being able to take mis-cased keys is a must if you wish to go down the dict route. But based on what you've said, I don't think outputting them that way is a good idea. Especially in case of wheels, where these mistakes become persistent. I feel like you could easily end up in a situation where tools based on different implementations of metadata parsers cannot read each others' wheels. |
It’s also notable that no common tools (AFAIK) can read Core Metadata into a data structure and then output that data structure into Core Metadata. Existing tools I know all produce Core Metadata from a foreign format, so the point you raised here (whether to preserve casing) is untrotted territory. I do agree with your though; it should be preferrable if the tool canonicalise cases though. |
@MrMino Thanks for the thorough look at this and the feedback!
Basically, if you run a type checker (e.g. mypy), it will tell you when you get something wrong. E.g., in VS Code with the Pylance extension (disclaimer: I work at MS and work closely with the Pylance team): For your "could you tell which of these three triplets contain only correct keys?" example, a type checker would tell you when you were at least using an incorrect key.
Depends on the tool, but hopefully they can eventually provide it (and I'm going to open an issue on Pylance to support this exact feature right now 😁 ). |
That is a key goal of this for me. I want a way to read and write core metadata, so this is part of defining a common data format to go to/from while being easy to work with from a tool author perspective. |
Yea, that's what I was expecting, but since I haven't used it I wouldn't know if the tools support it (and how).
Me too. I'm worried though, that projects like jedi might not have taken this into account architecturally. Providing autocompletion in string literals would be the last thing I would think of if I were to create an autocompletion engine. This seems like something ~atypical to implement. Wonder if I can PR this there, it would be super handy to have that. |
Wow, actually, scratch that. Jedi already supports TypedDicts on a branch (see jedi/#1478). Things seem to be moving much faster than I expected. |
FYI Pylance landed key completions for indexing on a TypedDict. What I think I might do is write out the data representation using both an object and TypedDict and see how they look (and post them here). Then based on that we can make a decision as to which looks like it would be the best going forward. |
@brettcannon feel free to copy paste some of mine for the attributes. Writing this by hand gets tedious fast 😉. |
OK, as a way to do some independent validation of ideas, I purposefully didn't read what @MrMino while writing out the following examples. I covered:
from __future__ import annotations
import pathlib
import typing
from typing import Any, FrozenSet, Iterable, Literal, Mapping, Type
from . import requirements
from . import specifiers
from . import version as pkg_version
# https://packaging.python.org/specifications/core-metadata/
class RequiredMetadata(typing.TypedDict):
metadata_version: Literal["1.0"] | Literal["1.1"] | Literal["1.2"] | Literal[
"2.1"
] | Literal["2.2"]
name: str
version: pkg_version.Version
class MetadataDict(RequiredMetadata, partial=True):
# `dynamic` is left out as it's calculated based on attributes being set to `None`.
# dynamic: 2.2
platforms: Iterable[str] | None
supported_platforms: Iterable[str] | None # 1.1
summary: str | None
description: tuple[str, str] | None # content-type added in 2.1.
keywords: Iterable[str] | None
home_page: str | None
download_url: str | None # 1.1
authors: Iterable[
tuple[str, str]
] | None # Is there a better way to represent name and email?
maintainers: Iterable[tuple[str, str]] | None # 1.2
license: str | None
classifiers: str | None # Should be typed based on trove-classifiers package. 1.1
requires_dists: Iterable[requirements.Requirement] | None # 1.2
requires_python: specifiers.SpecifierSet | None # 1.2
requires_externals: Iterable[str] | None # 1.2
project_urls: Mapping[str, str] | None # 1.2
provides_extras: Iterable[str] | None # 2.1
provides_dists: Iterable[requirements.Requirement] | None # Or str? 1.2
obsoletes_dists: Iterable[requirements.Requirement] | None # Or str? 1.2
class BaseMetadata(typing.TypedDict):
name: str
version: pkg_version.Version
class Metadata_1_0_Base(RequiredMetadata, partial=True):
platforms: Iterable[str] | None
summary: str | None
description: str | None
keywords: Iterable[str] | None
home_page: str | None
authors: Iterable[
tuple[str, str]
] | None # Is there a better way to represent name and email?
license: str | None
class Metadata_1_0(Metadata_1_0_Base):
version: Literal["1.0"]
class Metadata_1_1_Base(Metadata_1_0_Base, partial=True):
supported_platforms: Iterable[str] | None
download_url: str | None
classifiers: str | None
class Metadata_1_1(Metadata_1_1_Base):
version: Literal["1.1"]
class Metadata_1_2_Base(Metadata_1_1_Base, partial=True):
maintainers: Iterable[tuple[str, str]] | None
requires_dists: Iterable[requirements.Requirement] | None
requires_python: specifiers.SpecifierSet | None
requires_externals: Iterable[str] | None
project_urls: Mapping[str, str] | None
provides_dists: Iterable[requirements.Requirement] | None
obsoletes_dists: Iterable[requirements.Requirement] | None
class Metadata_1_2(Metadata_1_2_Base):
version: Literal["1.2"]
class Metadata_2_1_Base(Metadata_1_2_Base, partial=True):
# Using a two-item tuple would require having a 'description' mixin for
# previous versions as you can't redefine keys in a TypedDict.
description_content_type: str | None
provides_extras: Iterable[str] | None
def dynamic(metadata: MetadataDict) -> FrozenSet[str]:
"""Calculate the value of `dynamic` based on which keys have a `None` value."""
# TODO: Technically should return an Iterable of literals that match core
# metadata field names.
raise NotImplementedError
def from_project(
project: Mapping[str, Any], *, path: pathlib.Path | None = None
) -> MetadataDict:
"""Extract metadata from the `[project]` table of a `pyproject.toml` file.
The `path` argument is the path to the `pyproject.toml` file. This is
required to resolve and read files specified in the data relative to the
`pyproject.toml` file itself.
"""
raise NotImplementedError
def from_pkg_info(pkg_info: str, /) -> MetadataDict:
"""Parse PKG-INFO data for metadata."""
raise NotImplementedError
def from_metadata(metadata_source: str, /) -> MetadataDict:
"""Parse METADATA data for metadata."""
raise NotImplementedError
T = typing.TypeVar("T", bound="MetadataClass")
_UNSPECIFIED = object()
class MetadataClass:
# Any field that is set to `None` is considered dynamic, while an otherwise false value means purposefully unspecified.
name: str
version: pkg_version.Version
platforms: Iterable[str] | None
supported_platforms: Iterable[str] | None # 1.1
summary: str | None
description: tuple[str, str] | None # content-type added in 2.1.
keywords: Iterable[str] | None
home_page: str | None
download_url: str | None # 1.1
authors: Iterable[
tuple[str, str]
] | None # Is there a better way to represent name and email?
maintainers: Iterable[tuple[str, str]] | None # 1.2
license: str | None
classifiers: str | None # Should be typed based on trove-classifiers package. 1.1
requires_dists: Iterable[requirements.Requirement] | None # 1.2
requires_python: specifiers.SpecifierSet | None # 1.2
requires_externals: Iterable[str] | None # 1.2
project_urls: Mapping[str, str] | None # 1.2
provides_extras: Iterable[str] | None # 2.1
provides_dists: Iterable[requirements.Requirement] | None # Or str? 1.2
obsoletes_dists: Iterable[requirements.Requirement] | None # Or str? 1.2
def __init__(
self,
name,
version,
*,
platforms=(),
supported_platforms=(),
summary="",
description=("", ""),
keywords=(),
home_page="",
download_url="",
authors=(),
maintainers=(),
license="",
classifiers=(),
requires_dists=(),
requires_python=specifiers.SpecifierSet(),
requires_externals=(),
# A dict default value may get accidentally mutated.
project_urls=_UNSPECIFIED,
provides_extras=(),
provides_dists=(),
obsoletes_dists=(),
):
self.name = name
self.version = version
self.platforms = platforms
self.supported_platforms = supported_platforms
self.summary = summary
self.description = description
self.keywords = keywords
self.home_page = home_page
self.download_url = download_url
self.authors = authors
self.maintainers = maintainers
self.license = license
self.classifiers = classifiers
self.requires_dists = requires_dists
self.requires_python = requires_python
self.requires_externals = requires_externals
self.project_urls = project_urls if project_urls is not _UNSPECIFIED else {}
self.provides_extras = provides_extras
self.provides_dists = provides_dists
self.obsoletes_dists = obsoletes_dists
def dynamic(self) -> FrozenSet[str]:
"""Calculate the `dynamic` field based on which fields are set to `None`."""
raise NotImplementedError
@classmethod
def from_project(
cls: Type[T], project: Mapping[str, Any], *, path: pathlib.Path | None = None
) -> T:
"""Extract metadata from the `[project]` table of a `pyproject.toml` file.
The `path` argument is the path to the `pyproject.toml` file. This is
required to resolve and read files specified in the data relative to the
`pyproject.toml` file itself.
"""
raise NotImplementedError
@classmethod
def from_pkg_info(cls: Type[T], pkg_info: str, /) -> T:
"""Parse PKG-INFO data for metadata."""
raise NotImplementedError
@classmethod
def from_metadata(cls: Type[T], metadata_source: str, /) -> T:
"""Parse METADATA data for metadata."""
raise NotImplementedError
def to_pkg_info(self) -> str:
"""Return the representation of the metadata for PKG-INFO."""
raise NotImplementedError
def to_metadata(self) -> str:
"""Retrun the representation of the metadata for METADATA."""
raise NotImplementedError I think I prefer the class approach, but I'm flexible. I'm also curious what people think about the proposed types and naming for all the fields (I did take the plurality suggestion from @MrMino ). If we can get some agreement on this we can implement this gradually (i.e. just a class to hold the data, from |
Same. |
2 questions about the class approach:
|
I can answer both questions with the same answer: And we can't use |
Understood, thanks |
One thing that may inform attribute names is https://www.python.org/dev/peps/pep-0566/#json-compatible-metadata. |
FTR we had a Metadata class in distutils2, it was more that a typed data class/dict and did version detection, mutations and validation: https://hg.python.org/distutils2/file/tip/distutils2/metadata.py#l197 distlib updated it for versions 2.1 and 2.0 (withdrawn): https://bitbucket.org/pypa/distlib/src/master/distlib/metadata.py |
FYI, I have written a parser for PEP 621 metadata for my backend. It could be used as a starting point if it makes sense. https://github.com/FFY00/trampolim/blob/main/trampolim/_metadata.py |
I still hope to get to this some day, but not right now. In spite of that, my brain is still thinking about the API. 😄 As such, I'm trying to think through the single dict approach or the dataclass-like approach as the preferred representation. I think I'm leaning towards the dict so that in the future any metadata changes of adding keys will naturally flow into an API where keys may be missing compared to a class approach where the lack of an attribute could throw people if there's a mixing of old a new code/metadata versions. |
When you come up with a final API, I may be able to take some time to write the implementation. Let me know. |
Are there other issues to track the remaining aspects of the API? I can see an existing issue for the validation of metadata, but none related to the "ingestion" (via |
I have https://pypi.org/project/pkg-metadata/ but I don't know if I want to offer it to packaging or keep it standalone. It's still at a relatively early stage right now. |
@abravalheri GitHub got too aggressive with the auto-closing of this issue. 😅 |
Thank you very much @brettcannon 🎉 @pfmoore, I think it makes sense for |
@abravalheri I'm not against packaging having such APIs, I'm just not sure if the goals of If we can agree on an approach, I'm happy for |
Thank you very much for the clarification @pfmoore. I understand now, this makes a lot of sense (I also suggested a lenient conversion from/to email-header in #498). Maybe there is difference of interests in terms of who is the audience of the APIs Probably for PyPI a more strict approach would be suitable, while for a backend like setuptools ingesting not-particularly-well-formed email-headers would be the way to go as long as the output is well-formed. |
My personal use case is data analysis of projects already on PyPI. To put it politely, there are a lot of "deviations from the standards" on there 🙂 (some understandable, they are projects that pre-date the relevant standards, others less so, they simply make no sense to me at all!) I want a library that can parse and read all of that data, so that I can analyze the weirdnesses with useful data structures. The I don't know how best to handle the discrepancy in use cases here, which is why I say that for now I don't know what will happen. |
My wheel API project has exactly this problem. To maximize the utility it has to cater to both usage scenarios, and this means having separate "strict" and "lenient" modes (or at least it will, whenever I get to implementing that ;)). It would be nice if |
As of right now I would say what @pfmoore has for lenient parsing serves that group well (which are probably installers). But for builders, you want the strict solution to prevent users from propagating mistakes, and that doesn't seem to exist in a reusable package. And there are also those who want to tell users that their metadata is malformed (e.g. PyPI rejecting packages that messed something up). So the current plan is to support strict to start. But that isn't to say the strict code couldn't be tweaked to emit a warning instead of raising an exception is some if strict:
raise InvalidMetadata(msg)
else:
warnings.warn(msg, MetadataWarning) ... or whatever the equivalent would be. But what I'm not personally up for is maintaining two copies of metadata parsing to support both use-cases. |
IMO, it's important for a project to have a clear understanding of its own goals, and resist the temptation to try to be "all things to all men" as the saying goes. If Footnotes
|
From my experience, the amount of packages that fail to meet the standard, or use PEPs that aren't accepted yet (or will never be), is far larger than what one would expect. My personal goal for I imagined it to be a viable strategy for |
That is only a concern if (a) we choose to reject core metadata fields that shouldn't be there, and (b) you want access to those fields somehow. BTW, it's way easier to go from strict to lenient than the other way around. I would much rather be as strict as possible, do an analysis of where things fail from packages on PyPI due to misaligned expectations/assumptions, and then loosen accordingly to find a good middle ground. |
If The use case is building from old sdists:
|
I'm thinking Flit, Hatch, Setuptools, etc. that have a |
Since version caps for the build dependencies are not the most prevalent practice in the Python packaging ecosystem, I think any build-backend like Flit, Hatch, Setuptools would have to deal with old sdists. For example, let's say Setuptools adopts (Just to clarify, I think it is completely OK for |
Yep, setuptools specifically is in a tough spot with this.
And I do appreciate it! Here are my rough plans for
Footnotes |
@brettcannon Let's file a tracking issue, with that list as a checklist? That way, others can chime in and help get this over the line and we have a single place to look at how this is progressing. :) |
#570 has the TODO list. |
This is related to #147, but not quite since I want to go beyond validation and have read/write support for things like TOML,
METADATA
, etc.I see two potential API forms. One is a dataclass-like object that stores the data with some methods to help read/write to various formats. The other form is a dict with various functions to manage the reading/writing.
I actually think the dict approach might be the best and easiest to work with for two reasons. One, I don't see anything inherent in the potential design that will require OOP (i.e. method overriding to influence results). I have not thought of an API that would call other methods where overriding would be necessary.
Two, it makes representing the explicit lack of data versus what goes into
dynamic
ala PEP 621 more obvious. If the lack of a key means "information is explicitly not provided", then that can be expressed in a dict. If we make the existence of a key with aNone
value represent something that would go intodynamic
, then we don't have to maintaindynamic
details separately. Using a dataclass doesn't buy us anything over this as it makes representing no data a little trickier as we would have to add some way to delineate that (e.g. some other false value that isn'tNone
?). Best benefit to a dataclass I can think of is a property to dynamically calculatedynamic
(😁 ) which could also be done with a function.There's also no loss in typing as a partial
TypedDict
would work in this instance.So for me, the simplicity of using a dict for storing data in the appropriate formats and using the lack of key to delineate purposefully missing data is the way to go.
The text was updated successfully, but these errors were encountered: