-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PRC feedback - strict tool versioning for reproducibility #93
Conversation
openapi/ga4gh-tool-discovery.yaml
Outdated
@@ -643,6 +650,11 @@ definitions: | |||
updated: | |||
type: string | |||
description: Last time the container was updated. | |||
hashcode: | |||
type: string | |||
description: Helpful for immutable tool versions, but can be independent. This exposes the hashcode for specific image versions to verify that the container version pulled is actually the version that was indexed by the registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that the hashcode can be used for things other than immutable tool versions? The phrase can be independent does
sounds a little off to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it could potentially be useful even if the version is not immutable (i.e. if I want to check whether the version I got is what I expected to get, even if the version might be expected to change in the future).
Discussion:
Further discussion offline |
discussion with @denis-yuen :
I can't contribute about the biocontainers implementation, but if anyone from that team (@ypriverol @osallou) is at the BOSC CoFest on Friday-Saturday, please reach out to @Ldcabansay. |
Just some thoughts to share here. Tool (or software) versioning is a general topic, there are good practices exist, like Semantic Versioning. The idea of versioning is to create a snapshot of the tool/code so that running the same version at any later time users would expect same behaviour. Git allows us to pin the source code down to every single character, Docker empowers us to pin the entire compute environment down to every singe byte. Both are done and guaranteed via content driven hashcode. DRS should take advantage of this great feature, basically once hashcodes for source code and container image are recorded for a particular tool release version, it should be all good in theory. Of course we still rely on git repo (if we care about source code, I think we do) and container image being kept around. One slight downside is that hashcode is not human-friendly, the common practice is to use human-friendly tags for release versioning. Semantic versioning is a good standard for that practice, it's both human- and machine-friendly. How should hashcode and version tag fit together in TRS to enable better reproducibility? When a new tool version is needed, a version tag (eg, |
There are a couple of items going on here that I want to separate.
I think the above meets @junjun-zhang 's requirements, and would defend against most simple attacks. Thoughts? |
@delagoya, I like the idea that TRS implementation checks and verifies image digest (hashcode) against Docker registry. Also like TRS records be immutable, as such there is a guarantee when a user uses a tool version |
PRC comments
|
Two thoughts: I agree that ideally, TRS implementations should get to a point where they can pull user's Docker images and record their SHA256 codes. However, the issue I have with it is making it mandatory for all records in TRS as opposed to a highly recommended optional that denotes which records are immutable. Although it (or something like it) is in our roadmap, we can't commit to having this capability in a timely manner which might lead to a weird situation where we have the standard but no/little available content for it for a good period of time. Specific comments:
Did this have something in mind going beyond the description in https://github.com/ga4gh/tool-registry-service-schemas/pull/93/files#diff-967acd66e0600b3c12d631991cecfb6dR95
That seems reasonable to me. In the PR, this would currently be represented as a record with immutable set to false and a hashcode that the user could use to detect Docker image changes.
Not entirely sure how to represent this, currently version_ids in TRS are human-centric tags. i.e. users navigate the endpoints using 1.0.20190618201008 as opposed to 4c905b8 given Perhaps a parallel set of endpoints accessed using digests as opposed to version_ids?
Pending discussion, I've added a hashcode to the FileWrapper as well |
@denis-yuen
I see there are two competing use cases:
I feel that it would be good if we have a flag to explicitly indicate whether a version is intended for production use or not. With that in mind, maybe simply change the proposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@denis-yuen Could we sync the hashcode
field with how DRS implements checksums.
See Checksum
object in DRS https://github.com/ga4gh/data-repository-service-schemas/blob/144526746f75e6ff3079aa2189f46762d01599fb/openapi/data_repository_service.swagger.yaml#L168
Also see the DRS Checksum PR: ga4gh/data-repository-service-schemas#282
checksum object added and tweaked to match ga4gh/data-repository-service-schemas#282
@susheel updated to match discussion as it currently stands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes from 2.0.0 + CI fix: * Explicitly define type in definitions * Change properties to snakes case and endpoints to camelcase * Move enum to its own definition * Incrementing version due to breaking changes * New endpoint and object definition to describe available files * For issues #21 and #22 * Renamed File definition to ToolFile due to swagger editor * Increment version in comment * Add more file types, better descriptions * Tool checker proposal (#26) From https://docs.google.com/document/d/1PTge27WBOKCiR2MkVSOvWKUwnFk0kx-bjq39aKRQxcY/edit?usp=sharing * Generalize the TRS to account for Singularity and Nextflow * Change toolversion's boolean indicator * Add terms of service placeholder * Install swagger2openapi * Adding operationIds (#33) See dockstore/dockstore#1210 * Moved checker to a property in Tool, also added a boolean * Specify unencoded paths are also allowed * Feature/search (#48) * Add option for search for checker workflows * For #37 , allow meta_version to be optional ga4gh/workflow-execution-service-schemas#6 ga4gh/workflow-execution-service-schemas@337cd42 * Add alias suggestion from #38 * Standardize file fields * Simplify standard (#49) * Feature/link check (#55) * Create CONTRIBUTING.md (#65) * Automatically generate docs for all branches and tags * TRS documentation update (#64) * Update openAPI3.sh * Update BasePath (#72) Updated base path to `basePath: `/ga4gh/trs/v2` * Update ga4gh-tool-discovery.yaml (#74) * Updated LICENCE Copyright (#76) * Include mention to Nextflow (#77) Mentioning nextflow as long the API is supporting workflows based on Nextflow language (NFL) https://github.com/pditommaso/tool-registry-service-schemas/blob/develop/src/main/resources/swagger/openapi.yaml#L741-L744 * Allow for more than one container image (#58) * fix type, update OpenAPI version * fix array examples * Update openapi.yaml * Rename metadata to match GA4GH PAP (#78) * Adding zenodo reference as recommended * Adding contact URL inspired by WES for PAP From security questionaire * More convention sync with cloud workstream APIs (#84) * Working on #82 also see ga4gh/workflow-execution-service-schemas#134 * Working on harmonizing with GA4GH Discovery https://github.com/ga4gh-discovery/service-info * Fix description for #85 * WS-2019-0032 WS-2019-0063 * Attempt to fix tooling (#89) * Update README.md for gitflow (#91) * Simple PRC feedback PR (#92) * Feedback from BOSC * Typo in example (#99) * Remove service-info to avoid breaking change (#101) * Move around a few fields (#97) * test gradle fix for site deployment (#105) * Optional auth (#102) * PRC feedback - strict tool versioning for reproducibility (#93) * Update to match discussion in DRS checksum object added and tweaked to match ga4gh/data-repository-service-schemas#282 * Corrects link to GA4GH . checksum hash algorithm registry (#106) * Correcting hash algorithm registry link * Correcting hash alg registry link * Fixing unrelated link * Glitchy badge (#112) * Update registry.json (#115) * Fix CI (#123) Co-authored-by: Denis Yuen <denis.yuen@oicr.on.ca> Co-authored-by: Yasset Perez-Riverol <ypriverol@gmail.com> Co-authored-by: Chris Llanwarne <cjllanwarne@users.noreply.github.com> Co-authored-by: Susheel Varma <susheel@users.noreply.github.com> Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
This is the compromise that I'm proposing for the PRC feedback for tool versioning.
This should open the possibility of implementations providing/indexing immutable tool versions only.
However, it doesn't restrict us only to implementations that provide immutable versions.
FYI @DCGenomics @junjun-zhang @rishidev