Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a post-data report generation to cam-pipeline #151

Open
gaurav opened this issue Jun 22, 2024 · 1 comment
Open

Add a post-data report generation to cam-pipeline #151

gaurav opened this issue Jun 22, 2024 · 1 comment
Milestone

Comments

@gaurav
Copy link
Member

gaurav commented Jun 22, 2024

This would run after kg.tsv has been generated, and generate some kind of report so we know the file was generated correctly. At the simplest, this could check the number of rows is approximately 11,336,863 (which is where it was on the last generation).

Some other stats that might be useful to track:

  • Predicates by count
  • Number of nodes with direct types
  • Number of nodes by Biolink type
  • Number of edges by types (e.g. how many biolink:Gene --[GO:1234]--> biolink:Protein edges we have)
  • Some example nodes and edges

The main use of this report would be to make sure that we don't make a change that gets rid of a particular type of edge. Once we add qualifiers (#145), we could add a qualifier report as well to see how much detail we're adding.

We could implement this as a Scala Script -- it should be straightforward to implement in ZStream.

@gaurav gaurav added this to the Not urgent milestone Jun 22, 2024
@balhoff
Copy link
Contributor

balhoff commented Jun 24, 2024

Some of these things might be most efficient to calculate in the souffle script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants