This repository will try to explain the components that take part in the Team Calcium NIH Data Commons Pilot (and beyond).
Note, if you are viewing this on github, the images may be cached, please visit:
https://david4096.github.io/data-platforms/
For more background read the Data Biosphere post.
Visit the DataBiosphere github organization.
The various components coordinate to create a platform useful for data analysis.
Provides clients and services access to resources available in object stores. Digital objects can be files and the catalog itself maintains a registry of locations to find the files, as well as minimal metadata.
Allows globally unique identifiers to be "resolved" to digital objects. For more information please refer to Identifier Interoperability.
Identifiers can be given different namespaces or "prefixes". The namespace service allows commons members to easily manage GUIDs across projects and domains. For more information please refer to Identifier Interoperability.
To guarantee authority and authenticity of requests, some access control services are provided. These services will at least be able to identify a client and delegate authority to the access control system of choice.
Software which can orchestrate and execute computational tasks in heterogeneous computing environments.
A resource which contains templates of reusable computational tasks that can be directed at new data, and then executed by the Analytical Engine.
Clients accessing a commons infrastructure should be able to manage data for secondary and tertiary data analysis.
Data in commons infrastructure should be findable using Search mechanisms. Indexing makes data available for search.
A controlled vocabulary informs indexers and or querying applications to make metadata usable.
Metadata made available by a platform is indexed into a store. Indexers allow data to be made findable using a structured document scheme.
Once metadata have been indexed into a platform, these indices are made available by services that allow queries to be formed against the metadata.
Commons infrastructure should provide interfaces to make data easily findable. Once data has been found in a portal, it can be added to a workspace.
Applications combine a variety of Commons components to carry out specific tasks.
Links to source code repositories for implementations are provided below:
Component | Broad | UChicago CDIS | UCSC CGP |
---|---|---|---|
Digital Object Catalog | |||
GUID Resolver | indexd | dos-azul-lambda | |
Namespace Service | indexd | ||
Access Control | |||
Authorization | sam bond | fence | |
Authentication | sam bond | fence | |
Analytical Engine | Cromwell | toil | |
Tool Repository | Agora | Dockstore | |
Workspaces | Firecloud | jupyterhub | |
Indexing and Search | |||
Ontology | datadictionary | ||
Metadata Indexer | sheepdog | cgp-dss-azul-indexer | |
Metadata Querying | peregrine | cgp-dashboard-service | |
Portal | windmill | boardwalk | |
Application | xena |
The University of Chicago, CDIS groups presents software for easily managing the submission and access control of bioinformatics and medical informatics data in cloud environments.
This document is under active development. If you feel misrepresented or something has been miscommunicated, please open an issue or make a Pull Request!
The program used to edit the "dia" files is dia.
Github caches images when they display READMEs so be sure to check the actual file if it seems out of date!