Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gather behavior around identical sketch matches is confusing and ~error-prone #2319

Open
ctb opened this issue Oct 6, 2022 · 0 comments
Open

Comments

@ctb
Copy link
Contributor

ctb commented Oct 6, 2022

The root question that led to discovering #2318 was that @bluegenes was getting unexpectedly inaccurate results from a zymo mock community analysis, and she wanted to understand why.

In brief, there were two genomes that were producing identical sketches, GCA_902364275.1 Faecalibacterium prausnitzii, MGYG-HGUT-00195 and GCA_003478405.1 Faecalibacterium sp. AF28-13AC strain=AF28-13AC. And of course sourmash prefetch was finding both of them, and gather was picking one of them.

If it picked the first, the species was "correctly" assigned as Faecalibacterium prausnitzii because that's the NCBI taxonomy entry for this genome, and the mock community composition indicated it contained F. prausnitzii. If it picked the second, then it was missing the correct species.

See #1615 and the idea for making identical matches accessible. It is not clear to me how to properly handle this in gather output, since sourmash tax only sees gather output. Maybe we could provide multiple entries per rank where appropriate, and mark them as "secondary", and then sourmash tax could choose the one with the most specific tax, or something? Ugh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant