Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multigather CSV output uses signature filename as basename. #2328

Open
ctb opened this issue Oct 13, 2022 · 8 comments
Open

multigather CSV output uses signature filename as basename. #2328

ctb opened this issue Oct 13, 2022 · 8 comments
Labels
5.0 issues to address for a 5.0 release revisit_me An issue that needs attention and clarification
Milestone

Comments

@ctb
Copy link
Contributor

ctb commented Oct 13, 2022

In #2321 and #2322 we delve back into multigather... and I remembered how annoying the CSV output is, in that it is output to the signature filename for each query.

At the very least it would be good to have there be an option to put it somewhere else, like an md5sum or something. For 4.x this would be an option and we could make it default for v5.

An alternative is to deprecate multigather per #1614.

@ctb ctb added the 5.0 issues to address for a 5.0 release label Oct 13, 2022
ctb added a commit that referenced this issue Oct 14, 2022
…2322)

This PR fixes #2321 so that more than one output line is placed in the
CSV. Oops!

It also adds a notification of what the CSV output file name is.

Last but not least, it supports `--output-dir` as a way to set the base
path for all output files.

Fixes #2321.

TODO:

- [x] add tests
- [x] make sure filename output behavior is documented
- [x] consider adding an option to have multigather save CSV results in
some other way, like by md5 or ...something. - punted to
#2328
@ctb
Copy link
Contributor Author

ctb commented Oct 15, 2022

should support ident-based output, as well as md5short based output.

@ctb
Copy link
Contributor Author

ctb commented Aug 19, 2023

This is tackled over in #2065 by @olgabot.

A few observations and opinions -

  • code in MRG: Uniquify csv output from multigather #2065 breaks tests & semantic versioning because it adds md5sum willy nilly. Working on that in MRG: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum #2722 where I add -U/--output-add-query-md5sum
  • it doesn't address issues with 'dumb' filenames like '-' either (which is encoded in the tests, but ...seriously, should be changed/checked for).
  • I feel like we should be detecting/flagging file output overwrites anyway?? Some experimental code shows that it actually happens in two of our tests 😱
    • specifically, in test_multigather_metagenome_sbt_query_from_file_with_addl_query and test_multigather_metagenome_query_with_sbt_addl_query, output is overwritten, because the query GCF_000195995.1_ASM19599v1_genomic.fna.gz is in gcf_all.sbt.zip as well.

Provisional resolution per #2722 would be -

  • fail loudly and clearly when overwrites are happening!!
  • support -U/--output-add-query-md5sum
  • handle filename == '-' - this would be a change in behavior.

@olgabot
Copy link
Collaborator

olgabot commented Aug 21, 2023

Yes to these!

Provisional resolution per #2722 would be -

  • fail loudly and clearly when overwrites are happening!!
  • support -U/--output-add-query-md5sum
  • handle filename == '-' - this would be a change in behavior.

@ctb
Copy link
Contributor Author

ctb commented Aug 21, 2023

A few more thoughts on #2722 -

@ctb
Copy link
Contributor Author

ctb commented Aug 22, 2023

Taking a step back - what do we want to be able to do with multigather?

  • analyze large input collections, including singleton collections;
  • be assured we have all of the (distinct?) results somewhere and be able to load them!
  • identify results for a specific query and separate them out from the rest of the results;
  • (maybe) use results from multigather as a picklist?

Things to confirm:

  • as of 4.8.3 (before any of these changes) multigather is "up to date" with gather output
  • query filename matches what's in the nascent docs for multigather (should it be name of file sketched? or name of file from which sketch was loaded?)
  • the filename naming scheme as proposed works for glob-style downstream loading and concatenation

Things to resolve:

  • are we ok with breaking multigather CLI backwards compat? maybe yes: It wasn't documented in any way as of 4.8.3. anyway 🤷
  • what do we do with identical queries (based on md5sum)? do we complain about overwriting content, do we not run them since they've already been run, or do we do something else? and how does this impact downstream parsing?

@bluegenes
Copy link
Contributor

Just adding a vote here for allowing multigather to output single csv and zip files containing information from all query sigs.

  1. downstream gather csv summarization now uses the query information (name, md5sum, etc) to ensure that summarization is only done for the same query.
  2. For matches and unassigned, we could output each to a zipfile, where individual sigs could then be accessed downstream via picklists or split via sig split. Sigs within would still need to be named appropriately.

This would likely be especially useful when dealing with extremely large numbers of queries and/or for contig-level gather.

@ctb
Copy link
Contributor Author

ctb commented Aug 23, 2023

note also connection with contig gather #2564 - sketch genome with --singleton and then multigather => contig gather.

@luizirber luizirber added this to the 5.0 milestone Dec 1, 2023
ctb added a commit that referenced this issue Feb 29, 2024
…t-add-query-md5sum` (#2722)

This PR:
- adds documentation for `multigather` to sourmash docs!
- builds on #2065 /
#2721 so that tests pass.
- adds an option `-U/--output-add-query-md5sum` to `sourmash
multigather`
- adds an option `--force-allow-overwrite-output` to `sourmash
multigather`
- **CHANGES BEHAVIOR** of multigather by treating `query.filename ==
'-'` as if `query.filename` is empty, thus replacing it with md5sum
- **CHANGES BEHAVIOR** of multigather by failing loudly and clearly if
output files are going to be overwritten
- adds `-E/--extension` to allow output to files other than `.sig`

See discussion over in [#2328: `multigather` CSV output uses signature
`filename` as
basename](#2328).

To add:
- [x] tests for `-U`;
- [x] implement and test `-E/--extension`
- [x] implement and test `--force-allow-overwrite-output`
- [x] fix for `query_filename` being None/empty in `-U` branch
- [x] documentation update for changed output behavior for multigather:
'-' => using md5sum
- [x] documentation update for changed output behavior for multigather:
fails if overwrite happens
- [x] fix multigather link in docs

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Olga Botvinnik <olga.botvinnik@gmail.com>
Co-authored-by: Keya Barve <53328492+keyabarve@users.noreply.github.com>
Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com>
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Reiter <taylorreiter@gmail.com>
Co-authored-by: Erik Young <eeyoung@ucdavis.edu>
Co-authored-by: David Koslicki <dmkoslicki@gmail.com>
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Colton Baumler <baumlerc@farm.ucdavis.edu>
Co-authored-by: Luiz Irber <contact+github@luizirber.org>
Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com>
Co-authored-by: Peter Cock <p.j.a.cock@googlemail.com>
Co-authored-by: Francesco Beghini <francesco.beghini@yale.edu>
Co-authored-by: Jason Stajich <jason.stajich@ucr.edu>
Co-authored-by: Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@ctb
Copy link
Contributor Author

ctb commented Feb 29, 2024

#2722 has been merged!

I will look through this issue and extract undone things and useful ruminations into a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

4 participants