Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy linkage called incorrectly #2406

Open
mriglobal opened this issue Dec 14, 2022 · 4 comments
Open

scipy linkage called incorrectly #2406

mriglobal opened this issue Dec 14, 2022 · 4 comments
Assignees
Labels
5.0 issues to address for a 5.0 release

Comments

@mriglobal
Copy link

I might be wrong here, but it looks like the scipy linkage call invoked for the dendrogram is silently producing misleading output. The function signature for linkage expects either a condensed format distance matrix, or a "2D array of observation vectors" which it then passes to the scipy pdist function to generate a distance matrix. The output from sourmash compare is a similarity matrix which must be converted to a distance matrix via 1 - M and then converted to condensed format by the scipy squareform function. Sourmash invokes the linkage function on the similarity matrix directly where it appears to assume this is the "2D array of observation vectors" case, which it is not. I get very different dendrograms by these two methodologies. Is this working as intended?

@ctb
Copy link
Contributor

ctb commented Dec 14, 2022

hmm, that could indeed be a problem!

We do have a --distance output for sourmash compare so it would be an easy fix.

@mriglobal
Copy link
Author

I should point out, that in terms of maintaining meaningful relationships, this doesn't seem to be that impactful. In other words, instances that are similar to one another will also have similar pairwise distance comparisons to other instances.

@ctb
Copy link
Contributor

ctb commented Dec 16, 2022

I should point out, that in terms of maintaining meaningful relationships, this doesn't seem to be that impactful. In other words, instances that are similar to one another will also have similar pairwise distance comparisons to other instances.

that is very good to know!

but it is still concerning!

I'm going to point the clustering mavens at this one and let them work it out - @mr-eyes @bluegenes could you weigh in, pls?

If it turns out to be a situation where we should fix this (as I suspect), I'd also appreciate thoughts on what we should do - should we

ref also detecting oddball matrices per #2228

my hot take based on previous comment that this still maintains relationships is that we should fix it in v5, but it's not a strong opinion.

thanks again for finding this @mriglobal !

@ctb
Copy link
Contributor

ctb commented May 20, 2024

doing what I do best: punting to a different repository 😆

Over in the betterplot plugin, I hope to explore this and get it working "properly" - see sourmash-bio/sourmash_plugin_betterplot#15 - and then backport fixes and verification stuff into sourmash compare.

@ctb ctb added the 5.0 issues to address for a 5.0 release label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release
Projects
None yet
Development

No branches or pull requests

3 participants