-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Add graph-based clustering #234
Conversation
ee19b06
to
5628845
Compare
Adds ANI columns to pairwise and multisearch output, building off of @mr-eyes's ANI translations (#188) which got streamlined + added into sourmash core in sourmash-bio/sourmash#2943 Split from #234 to make things more concise/simpler. ## benchmark summary ## 12k ICTV viral genomes, scaled=200 79.8m comparisons | version | experiment | time | | -------- | -------- | -------- | | PR | no ANI | 21s | | PR | with ANI | 20s | | v0.9.0 | no ANI | 21s | ## 12k ICTV viral genomes, scaled=10 79.8m comparisons | version | experiment | time | | -------- | -------- | -------- | | PR | no ANI | 14m 0s | | PR | with ANI | 14m 6s | | main branch (~v0.9.0) | no ANI | 14m 47s |
yes.
done, sigh.
more biologist friendly and similar to what many other tools report. e.g. when outputting kraken report format in sourmash, i convert to a percent to keep as close to possible as the original format. |
😭 much appreciated
I don't think we need to be particularly biologist friendly in our raw CSVs. Syntactic and semantic mismatches are a huge problem for our own internal toolset tho! |
I am afraid this will introduce inconsistencies/confusion in handling the output files. We will need to take care of this information when doing any downstream processing. |
I previously (in ANI PR) had it all in percentage, including the CLI for cluster. No matter, it is all fractions now :). minor regret about my initial decision when I added ANI to sourmash 🤷🏻♀️😅 |
lol well if we're going to talk about regrets for early design decisions I'm sure I have a list somewhere... |
I ran A few comments/questions/requests:
Other thoughts:
What I ran
|
(and, I mean, y'know - nice work! :) |
This was initially confusing to me, as cluster keeps all nodes it sees (only doesn't add edges if they don't meet threshold). But of course, sketches with no similarity at all to other sketches will not appear in pairwise at all. It is probably worth evaluating the performance of writing self similarity to make cluster output robust. Otherwise I would probably default to using |
exactly! 😄 This is why I think a note in the docs (either in pairwise, or in cluster, or both?) is a good idea. If it is a persistent confusion we can always add an option to pairwise. |
I think this is ready to go, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐
just need to run the benchmarking! |
you can punt to an issue... ;) |
will do! |
This PR adds a new command,
cluster
, that can be used to cluster the output frompairwise
andmultisearch
.cluster
usesrustworkx-core
(which internally usespetgraph
) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output bypairwise
ormultisearch
, and will add all nodes to the graph to preserve singleton 'clusters' in the output.cluster
outputs two files:Component_X, name1;name2;name3...
cluster_size, count
context for some things I tried:
Punted Issues:
cluster
: develop downstream usage and visualization #248)cluster
: enable updating clusters #249)cluster
(benchmarkpairwise
-->cluster
#247)Related issues: