Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRC feedback - strict tool versioning for reproducibility #93

Merged
merged 13 commits into from
Aug 14, 2019

Conversation

denis-yuen
Copy link
Member

This is the compromise that I'm proposing for the PRC feedback for tool versioning.

This should open the possibility of implementations providing/indexing immutable tool versions only.
However, it doesn't restrict us only to implementations that provide immutable versions.

FYI @DCGenomics @junjun-zhang @rishidev

@denis-yuen denis-yuen requested review from ypriverol, susheel and a team July 15, 2019 21:15
@denis-yuen denis-yuen self-assigned this Jul 15, 2019
@@ -643,6 +650,11 @@ definitions:
updated:
type: string
description: Last time the container was updated.
hashcode:
type: string
description: Helpful for immutable tool versions, but can be independent. This exposes the hashcode for specific image versions to verify that the container version pulled is actually the version that was indexed by the registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that the hashcode can be used for things other than immutable tool versions? The phrase can be independent does sounds a little off to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it could potentially be useful even if the version is not immutable (i.e. if I want to check whether the version I got is what I expected to get, even if the version might be expected to change in the future).

@denis-yuen
Copy link
Member Author

Discussion:

  • recommendation to remove immutable, maybe all versions should be immutable @delagoya
  • maybe all implementations should keep track of specific docker containers
  • discussion on latest: how to we treat this and versions based on branches as opposed to tags

Further discussion offline

@hmenager
Copy link

discussion with @denis-yuen :

  • hashcode for the docker image to allow the client to check if e.g. it hasn't been replaced or rebuilt is good for multiple reasons (security?)
  • should not be mandatory esp for development purposes. maybe recommended for finalized images?

I can't contribute about the biocontainers implementation, but if anyone from that team (@ypriverol @osallou) is at the BOSC CoFest on Friday-Saturday, please reach out to @Ldcabansay.

@junjun-zhang
Copy link

junjun-zhang commented Jul 25, 2019

Just some thoughts to share here.

Tool (or software) versioning is a general topic, there are good practices exist, like Semantic Versioning. The idea of versioning is to create a snapshot of the tool/code so that running the same version at any later time users would expect same behaviour.

Git allows us to pin the source code down to every single character, Docker empowers us to pin the entire compute environment down to every singe byte. Both are done and guaranteed via content driven hashcode. DRS should take advantage of this great feature, basically once hashcodes for source code and container image are recorded for a particular tool release version, it should be all good in theory. Of course we still rely on git repo (if we care about source code, I think we do) and container image being kept around.

One slight downside is that hashcode is not human-friendly, the common practice is to use human-friendly tags for release versioning. Semantic versioning is a good standard for that practice, it's both human- and machine-friendly.

How should hashcode and version tag fit together in TRS to enable better reproducibility? When a new tool version is needed, a version tag (eg, 0.2.1) is created. Let's say at the moment container image hashcode is abcd..., if we guarantee the binding between 0.2.1 and abcd... is permanent, then we guarantee the release is immutable (hence reproducible). This should work well for production releases. For development where releases are experimental and temporary, eg, 0.2.1-rc.1, then it's reasonable to assume the binding of a particular image hashcode to 0.2.1-rc.1 is NOT permanent, a new hashcode may be bound with 0.2.1-rc.1 if needed. For common master / develop / feature git branches (and the corresponding container image tags), users should not except any of them tied to any stable hashcode.

@delagoya
Copy link

There are a couple of items going on here that I want to separate.

  1. Should a TRS record track a specific Docker image by SHA256 digest?

    • My strong opinion is that every TRS record should contain the SHA256 digest of the Docker image.
    • If a record is created by providing only Docker image tag, then the TRS implementation should be responsible for querying the Docker registry to get the SHA256 digest.
    • If a digest is provided along with a image tag, then the TRS implementation should be responsible for verifying that the tag and the. digest match at the Docker container registry.
  2. Should TRS records be immutable?

    • I am in favor of immutable records in TRS, and tagging records of significance with a human centric tag.
    • Each TRS record could have an underlying digest of the record fields, which would ensure that each record is immutable. Analogous to a git commit for every record change.
    • As I mentioned, some records would be able to be tagged with a human readable string. The tag is not part of the underlying record digest calculation, can change for a particular record version / digest, and tags can be re-assigned to a new digest. Very similar model to git and Docker registry tags.

I think the above meets @junjun-zhang 's requirements, and would defend against most simple attacks.

Thoughts?

@junjun-zhang
Copy link

@delagoya, I like the idea that TRS implementation checks and verifies image digest (hashcode) against Docker registry.

Also like TRS records be immutable, as such there is a guarantee when a user uses a tool version 0.2.1 he knows he is always running the same thing. For a particular tool run, if a user wants he can verify whether this guarantee is fulfilled by checking the image hashcode.

@rishidev
Copy link

PRC comments

  1. hash MUST be recorded, for production items. Optional for the development.
  • Semantic version makes things clear for a production release. RC may not have a hash, or the hash may not be stable, so should be mandatory, unless there are problems for representing the tool in this way.
  1. TRS wrapper should be immutable.

@denis-yuen
Copy link
Member Author

denis-yuen commented Jul 29, 2019

Two thoughts:

I agree that ideally, TRS implementations should get to a point where they can pull user's Docker images and record their SHA256 codes. However, the issue I have with it is making it mandatory for all records in TRS as opposed to a highly recommended optional that denotes which records are immutable.

Although it (or something like it) is in our roadmap, we can't commit to having this capability in a timely manner which might lead to a weird situation where we have the standard but no/little available content for it for a good period of time.

Specific comments:

@junjun-zhang

Semantic versioning is a good standard for that practice
Semantic version makes things clear for a production release

Did this have something in mind going beyond the description in https://github.com/ga4gh/tool-registry-service-schemas/pull/93/files#diff-967acd66e0600b3c12d631991cecfb6dR95

@junjun-zhang

For common master / develop / feature git branches (and the corresponding container image tags), users should not except any of them tied to any stable hashcode.

That seems reasonable to me. In the PR, this would currently be represented as a record with immutable set to false and a hashcode that the user could use to detect Docker image changes.

@delagoya

I am in favor of immutable records in TRS, and tagging records of significance with a human centric tag.

Not entirely sure how to represent this, currently version_ids in TRS are human-centric tags.

i.e. users navigate the endpoints using 1.0.20190618201008 as opposed to 4c905b8 given
https://github.com/common-workflow-language/cwltool/releases/tag/1.0.20190618201008 as an example

Perhaps a parallel set of endpoints accessed using digests as opposed to version_ids?

@rishidev

TRS wrapper should be immutable.

Pending discussion, I've added a hashcode to the FileWrapper as well

@junjun-zhang
Copy link

junjun-zhang commented Aug 1, 2019

@denis-yuen
Regarding semantic versioning, I think the question is whether it's required or recommended. Considering we'd like to encourage wide adoption of TRS, recommending it is a good choice (ie, https://github.com/ga4gh/tool-registry-service-schemas/pull/93/files#diff-967acd66e0600b3c12d631991cecfb6dR95 looks ok to me).

recommendation to remove immutable, maybe all versions should be immutable @delagoya

However, the issue I have with it is making it mandatory for all records in TRS as opposed to a highly recommended optional that denotes which records are immutable.

I see there are two competing use cases:

  • for production, image hashcode must be provided and immutable
  • for development, image hashcode may not be provided or may change even provided

I feel that it would be good if we have a flag to explicitly indicate whether a version is intended for production use or not.

With that in mind, maybe simply change the proposed immutable field to production or is_production? If production is true, hashcode is required and must be immutable.

Copy link
Member

@susheel susheel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@denis-yuen
Copy link
Member Author

@susheel updated to match discussion as it currently stands

Copy link
Member

@susheel susheel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@denis-yuen denis-yuen merged commit e58c2e4 into develop Aug 14, 2019
@denis-yuen denis-yuen deleted the immutable_support branch August 14, 2019 14:18
garyluu added a commit that referenced this pull request Feb 6, 2020
Changes from 2.0.0 + CI fix:

* Explicitly define type in definitions

* Change properties to snakes case and endpoints to camelcase

* Move enum to its own definition

* Incrementing version due to breaking changes

* New endpoint and object definition to describe available files

* For issues #21 and #22

* Renamed File definition to ToolFile due to swagger editor

* Increment version in comment

* Add more file types, better descriptions

* Tool checker proposal (#26)

From https://docs.google.com/document/d/1PTge27WBOKCiR2MkVSOvWKUwnFk0kx-bjq39aKRQxcY/edit?usp=sharing

* Generalize the TRS to account for Singularity and Nextflow
* Change toolversion's boolean indicator

* Add terms of service placeholder

* Install swagger2openapi

* Adding operationIds (#33)

See  dockstore/dockstore#1210

* Moved checker to a property in Tool, also added a boolean

* Specify unencoded paths are also allowed

* Feature/search (#48)

* Add option for search for checker workflows
* For #37 , allow meta_version to be optional

ga4gh/workflow-execution-service-schemas#6
ga4gh/workflow-execution-service-schemas@337cd42

* Add alias suggestion from #38
* Standardize file fields
* Simplify standard (#49)

* Feature/link check (#55)

* Create CONTRIBUTING.md (#65)

* Automatically generate docs for all branches and tags

* TRS documentation update (#64)

* Update openAPI3.sh

* Update BasePath (#72)

Updated base path to `basePath: `/ga4gh/trs/v2`

* Update ga4gh-tool-discovery.yaml (#74)

* Updated LICENCE Copyright (#76)

* Include mention to Nextflow (#77)

Mentioning nextflow as long the API is supporting workflows 
based on Nextflow language (NFL) https://github.com/pditommaso/tool-registry-service-schemas/blob/develop/src/main/resources/swagger/openapi.yaml#L741-L744

* Allow for more than one container image (#58)

* fix type, update OpenAPI version

* fix array examples

* Update openapi.yaml

* Rename metadata to match GA4GH PAP (#78)

* Adding zenodo reference as recommended
* Adding contact URL inspired by WES for PAP
From security questionaire

* More convention sync with cloud workstream APIs (#84)

* Working on #82 also see ga4gh/workflow-execution-service-schemas#134
* Working on harmonizing with GA4GH Discovery  https://github.com/ga4gh-discovery/service-info
* Fix description for #85

* WS-2019-0032  WS-2019-0063

* Attempt to fix tooling (#89)

* Update README.md for gitflow (#91)

* Simple PRC feedback PR (#92)

* Feedback from BOSC

* Typo in example (#99)

* Remove service-info to avoid breaking change (#101)

* Move around a few fields (#97)

* test gradle fix for site deployment (#105)

* Optional auth (#102)

* PRC feedback - strict tool versioning for reproducibility (#93)

* Update to match discussion in DRS
checksum object added and tweaked to match ga4gh/data-repository-service-schemas#282

* Corrects link to GA4GH . checksum hash algorithm registry (#106)

* Correcting hash algorithm registry link

* Correcting hash alg registry link

* Fixing unrelated link

* Glitchy badge (#112)

* Update registry.json (#115)

* Fix CI (#123)

Co-authored-by: Denis Yuen <denis.yuen@oicr.on.ca>
Co-authored-by: Yasset Perez-Riverol <ypriverol@gmail.com>
Co-authored-by: Chris Llanwarne <cjllanwarne@users.noreply.github.com>
Co-authored-by: Susheel Varma <susheel@users.noreply.github.com>
Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants