Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add abundance-weighted columns to gather output #2249

Merged
merged 11 commits into from
Sep 2, 2022
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Sep 1, 2022

This PR adds three abundance-weighted columns to gather output:

  • n_unique_weighted_found - the summed abundances of the found hashes at each step
  • sum_weighted_found - the running total of n_unique_weighted_found, cumulative at this step
  • total_weighted_hashes - the sum total of all hash abundances

It also updates the kreport format of sourmash tax metagenome to output weighted bp estimates.

Fixes #2240.

TODO

  • add loading and error-message-if-missing to sourmash tax
  • update kreport format from [MRG] add kreport output format to tax metagenome #2239 to support weighted reporting
  • implement & test what happens to these values with --ignore-abundance
  • clean up redundant total_abund in GatherResult
  • revisit GatherResult and FracMinHashComparison classes for other cleanup
  • consider building more robust/bigger tests of the weighting 🙄

@codecov
Copy link

codecov bot commented Sep 1, 2022

Codecov Report

Merging #2249 (612157c) into latest (db8ca4a) will increase coverage by 0.00%.
The diff coverage is 96.29%.

@@           Coverage Diff           @@
##           latest    #2249   +/-   ##
=======================================
  Coverage   84.84%   84.85%           
=======================================
  Files         131      131           
  Lines       15653    15664   +11     
  Branches     2245     2249    +4     
=======================================
+ Hits        13281    13291   +10     
  Misses       2082     2082           
- Partials      290      291    +1     
Flag Coverage Δ
python 92.19% <96.29%> (-0.01%) ⬇️
rust 65.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/tax/tax_utils.py 98.32% <94.11%> (-0.14%) ⬇️
src/sourmash/search.py 97.95% <100.00%> (+0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ctb
Copy link
Contributor Author

ctb commented Sep 1, 2022

@bluegenes ok, updated. On inspection I think that perhaps it is/was possible to calculate some or all of this stuff before from the CSV output - test test_gather_abund_10_1 in test_sourmash.py seems to do so, at any rate! - but I like the new column names.

If you get a chance to look at this, would appreciate your interim stamp of approval :). I might not get to work on this more until later today.

@bluegenes
Copy link
Contributor

@bluegenes ok, updated. On inspection I think that perhaps it is/was possible to calculate some or all of this stuff before from the CSV output - test test_gather_abund_10_1 in test_sourmash.py seems to do so, at any rate! - but I like the new column names.

If you get a chance to look at this, would appreciate your interim stamp of approval :). I might not get to work on this more until later today.

STAMP --the new columns are great and I think they'll be really helpful!

ye, re already able to calculate most things - the main issue was needing total_weighted_missed for the unclassified portion, which I wasn't sure how to calculate without outputting the total weighted query hashes from gather.

@ctb
Copy link
Contributor Author

ctb commented Sep 2, 2022

@bluegenes I think I'm missing one thing before I can properly finish this PR off - I need to regenerate tests/test-data/tax/test1.gather.csv, but I don't have test1.sig handy. Do you know where this is?

@bluegenes
Copy link
Contributor

bluegenes commented Sep 2, 2022

@bluegenes I think I'm missing one thing before I can properly finish this PR off - I need to regenerate tests/test-data/tax/test1.gather.csv, but I don't have test1.sig handy. Do you know where this is?

I believe it was this one, just renamed: https://github.com/taylorreiter/2021-sourmash-taxonomy-hackathon/tree/main/outputs/sigs

..realizing now I probably should have added it, sry!

test1.gather.csv is this file, renamed: https://github.com/taylorreiter/2021-sourmash-taxonomy-hackathon/blob/main/outputs/gather/HSMA33MX_gather_x_gtdbrs202_k31.csv

@ctb ctb changed the title [WIP] add abundance-weighted columns to gather output [MRG] add abundance-weighted columns to gather output Sep 2, 2022
@ctb
Copy link
Contributor Author

ctb commented Sep 2, 2022

I think this is ready for review @bluegenes.

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. thanks for adding!

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

report abundance weighted hashes/bp from gather and tax?
2 participants