Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reproducibility] Versioning the repodata #3

Open
dhirschfeld opened this issue Jul 15, 2020 · 3 comments
Open

[reproducibility] Versioning the repodata #3

dhirschfeld opened this issue Jul 15, 2020 · 3 comments

Comments

@dhirschfeld
Copy link

The solution conda/mamba comes up with is dependent on the universe of packages in the repodata. This means that installations at different times can potentially produce different environments.

The proper solution to this is to export the explicit specs at the time you create the environment so that it can be recreated exactly in future.

Versioning the repodata can help in cases where the analyst didn't export the explicit specs for their environment. In this case to reproduce their results it could help to be able to specify an as_of_timestamp to solve for the environment given the state of the repodata as of the specified time.

@dhirschfeld
Copy link
Author

Prior Art:

To achieve similar reproducibility goals MRO snapshots the entire CRAN universe every day. Where you control the repository you can instead simply associate an inserted/uploaded timestamp with the package data and filter on that to present the state of the repository at any given time.

ref: https://mran.microsoft.com/documents/rro/reproducibility

@wolfv
Copy link
Member

wolfv commented Jul 22, 2020

Hi @dhirschfeld thanks for the comment!
Indeed this is alrady possible with what we have today, because each package has a timestamp in the repodata.json. We could simply filter all packages with a timestamp > 123 to get to the repodata state as of 123.
It will require some work in mamba to make it happen, though.

@dhirschfeld
Copy link
Author

IIUC that's the build timestamp of the package? If you make the assumption that all packages are built by CI and uploaded immediately after then it could stand-in as a proxy for when it was available for a client (mamba/conda) to query/download. That assumption doesn't always hold though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants