Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kickstart the schemas revamp (WIP) #26

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

Conversation

jaimergp
Copy link

@jaimergp jaimergp commented May 8, 2023

Description

xref conda/ceps#52

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

Tasks

Item Model Tests Docs
channeldata.json X
conda_build_config.yaml
conda.lock
condarc X
constructor.yml
environment.yml X
conda-meta/history X
MatchSpec X
menuinst.json
meta.yaml X
info/*
PackageRecord X
conda-meta/record.json
patch_instructions.json X
repodata.state
repodata.json X
requirements.txt X
Version
types X
exceptions

@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label May 8, 2023
jaimergp and others added 11 commits May 9, 2023 12:16
class RepodataPatchInstructions(ExtrasForbiddenModel):
packages: dict[TarBz2PackageFileNameStr, AllOptionalRepodataRecord]
"The .tar.bz2 packages in the repodata that will be patched"
packages_conda: dict[CondaPackageFileNameStr, AllOptionalRepodataRecord] = Field(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it doesnt make sense to patch all fields from a RepodataRecord. Changing the name of a package doesnt make sense and changing the hash might break cache implementations. In Rattler we explicity limit the fields that can be patched for that reason. We also didnt encounter fields not covered by our implementation. Might be worth considering!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it should be prevented and it's A Bad Idea (tm).

So while the file format seems to accept everything, there are some guidelines place for "reasonable" use and it looks like this is mostly addressed in code reviews, but sometimes we have exceptions. 🤷

I've been thinking whether we want to get descriptive or prescriptive in this PR, and I have mixed feelings about it. For scalars, I've been trying to be more prescriptive, but with structs... just reflecting what (I think) is in use these days. I am not saying I won't change my mind about it, just writing down what the approach have been so far :D

Thanks for the feedback!

Also, are we sure we have a packages.conda field? Maybe the readme is outdated, and the seed dict is too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking whether we want to get descriptive or prescriptive in this PR, and I have mixed feelings about it. For scalars, I've been trying to be more prescriptive, but with structs... just reflecting what (I think) is in use these days. I am not saying I won't change my mind about it, just writing down what the approach have been so far :D

I understand where you're coming from. With rattler we definitely choose the prescriptive approach because I feel like there is way too much "legacy" code and supporting all those exceptions is simply harder. But the same logic does not necessarily hold for this project.

The packages.conda field is definitely used by the conda-forge-repodata-patches package. The readme does indeed seem to be outdated. Here is a small excerpt from the latest linux-64 patches:

 "packages.conda": {
    "4ti2-1.6.9-hbc9de56_1.conda": {
      "license_family": "GPL"
    },
    "acctaudem-1.0.1-h69042ef_3.conda": {
      "license_family": "GPL"
    },
    "actions-runner-2.299.1-he0ac6c6_0.conda": {
      "license_family": "MIT"
    },
    "actions-runner-2.300.0-h0cdce71_0.conda": {
      "license_family": "MIT"
    },

Copy link

@morremeyer morremeyer Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the current state, I thin we should be descriptive in the initial implementation: The first step in this effort is to capture what is used currently so that people and projects can refer to the schema to ensure that their software is implemented correctly.

In further steps, we could get more and more prescriptive.

This brings me to a second thing that we'll need to have: versioning. As with everything, at some point, the schema will change.

To enable implementations to reference "the schema" they're using, we'll need to version the schemas. If we don't, we effectively can't change them without breaking something (be that software or user trust).

Edit: Is the repodata_version the version of the schema?

subdir: SubdirStr
"The subdirectory of the channel this package is in"
timestamp: Optional[PositiveInt] = None
"The date this entry was created"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be specified at seconds since epoch UTC? Also, is the intent to track the date the package was built, the date the record was created, or something else?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already a unix timestamp (from the unix epoch which is UTC) but in milliseconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some older packages specify a timestamp in seconds. Would definitely be good to document this!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird system. If the seconds would be in the far future, divide by 1000. But that's how it works. I prefer seconds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure is!

This is what we do in rattler.

While building http://prefix.dev we ran into more of these slight inconsistencies. Everything we encountered we added to the rattler library.

"The md5 hash of the package archive"
name: PackageNameStr
"The name of the package"
noarch: NoarchStr
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Optional? Many current package records on .ORG do not include this field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jep!

Copy link

@barabo barabo Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the present state of package records in anaconda.org, there are several commonly used fields included in package records that are not accounted for here. To handle the pragmatics of removing them without breaking anyone downstream, I'm wondering if there should be a new deprecated field which would be an object containing the deprecated values.

Consider this example package current listed in .ORG:

    "cuda-11.3.0-h3b286be_0.tar.bz2": {
      "arch": "x86_64",
      "binstar": {
        "channel": "main",
        "owner_id": "5be2119d5c4b9a1104ccdc8f",
        "package_id": "60ef382c6bd0f2f84f7c23e1"
      },
      "build": "h3b286be_0",
      "build_number": 0,
      "depends": [
        "cuda-runtime >=11.3.0",
        "cuda-toolkit >=11.3.0"
      ],
      "has_prefix": false,
      "machine": "x86_64",
      "md5": "2685f7931e7576b43ce597a87ae78091",
      "name": "cuda",
      "operatingsystem": "linux",
      "platform": "linux",
      "requires": [],
      "size": 2030,
      "subdir": "linux-64",
      "target-triplet": "x86_64-any-linux",
      "timestamp": 1627403602120,
      "version": "11.3.0"
    }

With a deprecated section it might look like this:

    "cuda-11.3.0-h3b286be_0.tar.bz2": {
      "arch": "x86_64",
      "build": "h3b286be_0",
      "build_number": 0,
      "depends": [
        "cuda-runtime >=11.3.0",
        "cuda-toolkit >=11.3.0"
      ],
      "deprecated": {
        "binstar": {
          "removal_date": "2024-09-20",
          "value": {
            "channel": "main",
            "owner_id": "5be2119d5c4b9a1104ccdc8f",
            "package_id": "60ef382c6bd0f2f84f7c23e1"
          }
        },
        "has_prefix": {
          "removal_date": "2024-01-01",
          "value": false
        },
        "machine": {
          "removal_date": "2024-01-01",
          value: "x86_64"
        },
        "operatingsystem": {
          "removal_date": "2024-09-20",
          "value": "linux"
        },
        "requires": {
          "note": "consider using the 'constrains' attribute.",
          "removal_date": "2024-01-01",
          "value": []
        },
        "target-triplet": {
          "removal_date": "2024-01-01",
          "value": "x86_64-any-linux"
        }
      },
      "md5": "2685f7931e7576b43ce597a87ae78091",
      "name": "cuda",
      "platform": "linux",
      "size": 2030,
      "subdir": "linux-64",
      "timestamp": 1627403602120,
      "version": "11.3.0"
    }

Copy link

@barabo barabo Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, WRT the RepodataRecord resource, is the intent that it's a write-once resource? I assume this is true because of the existence of hotfixing, but it might be worth clarifying that somewhere.

If there is strong intent that the resource IS an immutable record, it might also be reasonable to define a canonicalized record hash (or signature), which could be used to verify the record is complete.

While I don't love stuffing a verifier into an object that must first be removed for it to be verified, I do like that it would be 'close' to the record, and not in a separate section (like the signatures section used in conda-content-trust).

Anyway, if there's no appetite for such a feature, I'd be happy with a comment somewhere clarifying whether the resource is or isn't intended to be immutable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not include deprecated fields an a deprecated field, but rather include them as they are used today. That will make the schema more usable for e.g. autocompletion.

However, we do need a mechanism to deprecate and track deprecations.

@chenghlee
Copy link
Contributor

IMO, @barabo's comments bring up an excellent point: what we should consider "canonical", as things stand today? I've been operating under the [pseudo-]assumption that "canonical" means whatever conda/conda-build/conda-index/etc. do, but those clearly don't quite match the ecosystem's package hosting provider (anaconda.org) does...except in the cases on the "major" channels like Anaconda, conda-forge, bioconda, etc. do. 🙃

@baszalmstra
Copy link

We’ve had a similar discussion above: #26 (comment)

@jaimergp
Copy link
Author

jaimergp commented Nov 8, 2023

FTR, @morremeyer and I talked yesterday about this. One idea that came up was to "canonicalize" the classic implementation in conda/conda-build/conda-info, but leave the schema open for extra fields (which seems to be something Anaconda.org uses). Those extra fields can be further documented in derivative schemas (à la Python subclass) for those non-canonical channels.

Once we have schemas and versions for those, then we can start deprecating and migrating implementations over to a unified standard.

We also discussed how the fields in depends and constrains are basically free strings, and are only validated by chance during the test section in conda-build (but for example, an unused run_constrained item with a bad version will not ring any alarms until a user tries to install something that brings it to the solver!). Ideally the syntax in these strings is also standardized with a grammar or at least a decent regex:

package_name [version [build_string]]

, where package_name has its rules, and so do version and build_string.

@baszalmstra
Copy link

All of that sounds perfect! 👌

Copy link

github-actions bot commented Nov 8, 2024

Hi there, thank you for your contribution!

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed automatically if no further activity occurs.

If you would like this pull request to remain open please:

  1. Rebase and verify the changes still work
  2. Leave a comment with the current status

NOTE: If this pull request was closed prematurely, please leave a comment.

Thanks!

@github-actions github-actions bot added the stale [bot] marked as stale due to inactivity label Nov 8, 2024
@jaimergp
Copy link
Author

jaimergp commented Nov 8, 2024

Not stale

@github-actions github-actions bot added stale::recovered [bot] recovered after being marked as stale and removed stale [bot] marked as stale due to inactivity labels Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed [bot] added once the contributor has signed the CLA stale::recovered [bot] recovered after being marked as stale
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

7 participants