-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scipy linkage called incorrectly #2406
Comments
hmm, that could indeed be a problem! We do have a |
I should point out, that in terms of maintaining meaningful relationships, this doesn't seem to be that impactful. In other words, instances that are similar to one another will also have similar pairwise distance comparisons to other instances. |
that is very good to know! but it is still concerning! I'm going to point the clustering mavens at this one and let them work it out - @mr-eyes @bluegenes could you weigh in, pls? If it turns out to be a situation where we should fix this (as I suspect), I'd also appreciate thoughts on what we should do - should we
ref also detecting oddball matrices per #2228 my hot take based on previous comment that this still maintains relationships is that we should fix it in v5, but it's not a strong opinion. thanks again for finding this @mriglobal ! |
doing what I do best: punting to a different repository 😆 Over in the |
I might be wrong here, but it looks like the scipy linkage call invoked for the dendrogram is silently producing misleading output. The function signature for linkage expects either a condensed format distance matrix, or a "2D array of observation vectors" which it then passes to the scipy pdist function to generate a distance matrix. The output from sourmash compare is a similarity matrix which must be converted to a distance matrix via 1 - M and then converted to condensed format by the scipy squareform function. Sourmash invokes the linkage function on the similarity matrix directly where it appears to assume this is the "2D array of observation vectors" case, which it is not. I get very different dendrograms by these two methodologies. Is this working as intended?
The text was updated successfully, but these errors were encountered: