-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix requires_python metadata + add repair metadata command #779
Conversation
e320362
to
08eeaba
Compare
"""Common list parsing for a string of hrefs/prns.""" | ||
h = rf"(?:{settings.API_ROOT}(?:[-_a-zA-Z0-9]+/)?api/v3/repositories/python/python/[-a-f0-9]+/)" | ||
p = r"(?:prn:python\.pythonrepository:[-a-f0-9]+)" | ||
r = rf"{h}|{p}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "r" is technically needed here. But hey...
Have you heard about verbose regex though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I haven't, is there a better way to rewrite this? I'm so bad at regex...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h = rf"(?:{settings.API_ROOT}(?:[-_a-zA-Z0-9]+/)?api/v3/repositories/python/python/[-a-f0-9]+/)" | ||
p = r"(?:prn:python\.pythonrepository:[-a-f0-9]+)" | ||
r = rf"{h}|{p}" | ||
return re.findall(r, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will just ignore all the parts of the string that don't match either right?
I think we should whitespace split and match individually to be able to identify bad input.
Specifically since gibberish is currently handled as if not specified at all and that would be surprising to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it currently ignores all bad input. I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also think about re.compile
when calling more often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed it. Let me know if I need to change it some more
assert content.version == "0.1" | ||
assert content.packagetype == "sdist" | ||
assert content.requires_python == "" # technically null | ||
assert content.author == "Austin Macdonald" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 @asmacdo
:param content: The PythonPackageContent queryset. | ||
Return: number of content units that were repaired | ||
""" | ||
# TODO: Add on_demand content repair? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data repair is not meant to unblock a migration / update, right?
In that case I'd say a best effort approach is just fine. We never really own the on-demand content anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not meant to unblock, but I have a feeling that a lot of packages in users' repositories contain bad metadata. Like when we do a sync (which are default on-demand) currently all the non-latest versions of a package potentially get incorrect metadata because the majority of the available metadata (the info
section) is from the latest package. There is a way to get the correct metadata for each version, but it adds an API call for each version, so I've never implemented it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bad is it? Is it documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not too hard, it's this api: https://docs.pypi.org/api/json/. You've seen it around, I've been using it in some of my scripts (oci-images check updates script for example)
08eeaba
to
72f3cd3
Compare
for v in value.split(" ,"): | ||
if v: | ||
if match := r.match(v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's parse even more user friendly:
for v in value.split(" ,"): | |
if v: | |
if match := r.match(v): | |
for v in value.split(","): | |
if v: | |
if match := r.match(v.strip()): |
72f3cd3
to
ac33f4b
Compare
fixes: #773
This also centralizes the parsing of metadata for PythonPackageContent into two functions:
parse_metadata
: when the input data is from the PyPI json apiartifact_to_python_content_data
: when the input data is directly from the artifact itselfI'll probably rename
parse_metadata
in a future PR since its real purpose is to prepare the python content data for saving (really only used in the sync pipeline right now). All the other ways python content is made, pulp upload, pypi (twine) upload, pull-through caching, now useartifact_to_python_content_data
which should make it easier to add/remove metadata from the model.