Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Build a LineageDB interface for taxonomy databases/information #1651

Merged
merged 31 commits into from
Jul 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
48b3daa
start making a LineageDB
ctb Jul 2, 2021
5cd09c2
further abstract taxonomy loading
ctb Jul 2, 2021
17e29cc
replace rest of taxonomy parsing
ctb Jul 2, 2021
d5f09d0
get some basic sqlite stuff going
ctb Jul 3, 2021
6ad2480
refactoring, simplification, optimization
ctb Jul 3, 2021
407e8cf
fix a few things, alert on bad arg combination
ctb Jul 3, 2021
abeab91
add combine_tax CLI command under tax
ctb Jul 3, 2021
d11911b
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Jul 7, 2021
abd8d82
rename 'combine_tax' to 'prepare'
ctb Jul 7, 2021
77994e4
allow output database type for tax prepare
ctb Jul 7, 2021
ae16c29
format switching now works for sql and csv
ctb Jul 7, 2021
a55efef
adjust loading to try/fail rather than suffix
ctb Jul 7, 2021
941aaea
refactor taxonomy load
ctb Jul 7, 2021
c082a9b
fix tests
ctb Jul 7, 2021
db22ffb
basic tests for in/out formats
ctb Jul 7, 2021
8174530
add tests for trying to load bad files
ctb Jul 7, 2021
41cc317
raise appropriate error message
ctb Jul 7, 2021
4c0859f
fix with pytest.raises foo
ctb Jul 7, 2021
e161da1
some more tests
ctb Jul 7, 2021
4b13f27
more tests, fix bug
ctb Jul 7, 2021
ad8899b
make available_ranks work with sqlite db
ctb Jul 7, 2021
2f0185f
re-add test for available_ranks
ctb Jul 7, 2021
4519a07
produce more useful errors for dups, and restore test code
ctb Jul 7, 2021
3f38145
catch exception for database already exists
ctb Jul 7, 2021
5016b86
test split idents and keep versions
ctb Jul 8, 2021
8bb3145
align keyword args with CLI args
ctb Jul 8, 2021
4d3d7af
add more end-to-end tests
ctb Jul 8, 2021
88f03a2
alias --taxonomy-csv to --taxonomy
ctb Jul 8, 2021
79c9d56
'prepare' docs
ctb Jul 8, 2021
703534a
clean up
ctb Jul 8, 2021
4225f53
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Jul 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 43 additions & 16 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,11 +425,12 @@ signatures, rather than all the signatures in the database.

The sourmash `tax` or `taxonomy` commands integrate taxonomic
information into the results of `sourmash gather`. All `tax` commands
require one or more properly formatted `taxonomy` csv files that correspond to
the database(s) used for `gather`. Note that if using multiple databases,
the `gather` needs to have been conducted against all desired databases
within the same `gather` command (we cannot combine separate `gather` runs
for the same query). For supported databases (e.g. GTDB, NCBI), we provide
require one or more properly formatted `taxonomy` files where the
identifiers correspond to those in the database(s) used for
`gather`. Note that if using multiple databases, the `gather` needs
to have been conducted against all desired databases within the same
`gather` command (we cannot combine separate `gather` runs for the
same query). For supported databases (e.g. GTDB, NCBI), we provide
taxonomy csv files, but they can also be generated for user-generated
databases. For more information, see [databases](databases.md).

Expand All @@ -454,8 +455,8 @@ As with all reference-based analysis, results can be affected by the
results from `gather` minimizes issues associated with increasing size
and redundancy of reference databases.

For more on how `gather` works and can be used to classify signatures, see
[classifying-signatures](classifying-signatures.md).
For more details on how `gather` works and can be used to classify
signatures, see [classifying-signatures](classifying-signatures.md).


### `sourmash tax metagenome` - summarize metagenome content from `gather` results
Expand All @@ -469,7 +470,7 @@ example command to summarize a single `gather csv`, where the query was gathered
```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
--taxonomy gtdb-rs202.taxonomy.v2.csv
```

There are three possible output formats, `csv_summary`, `lineage_summary`, and
Expand Down Expand Up @@ -517,7 +518,7 @@ To generate `krona`, we add `--output-format krona` to the command above, and
```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

Expand Down Expand Up @@ -545,7 +546,7 @@ To generate `lineage_summary`, we add `--output-format lineage_summary` to the s
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

Expand Down Expand Up @@ -602,7 +603,7 @@ We can use `tax genome` on this gather csv to classify our "Sb47+63" mixed-strai
```
sourmash tax genome
--gather-csv 47+63_x_gtdb-rs202.gather.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
--taxonomy gtdb-rs202.taxonomy.v2.csv
```
> This command uses the default classification strategy, which uses a
containment threshold of 0.1 (10%).
Expand Down Expand Up @@ -649,7 +650,7 @@ To generate `krona`, we must classify by `--rank` instead of using the
```
sourmash tax genome
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```
> Note that specifying `--rank` forces classification by rank rather than
Expand Down Expand Up @@ -684,10 +685,35 @@ By default, `annotate` uses the name of each input gather csv to write an update
```
sourmash tax annotate
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
--taxonomy gtdb-rs202.taxonomy.v2.csv
```
> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`

### `sourmash tax prepare` - prepare and/or combine taxonomy files

All `sourmash tax` commands must be given one or more taxonomy files as
parameters to the `--taxonomy` argument. These files can be either CSV
files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases
are much faster for large taxonomies, while CSV files are easier to view
and modify using spreadsheet software.

`sourmash tax prepare` is a utility function that can ingest and validate
multiple CSV files or sqlite3 databases, and output a CSV file or a sqlite3
database. It can be used to combine multiple taxonomies into a single file,
as well as change formats between CSV and sqlite3.

The following command will take in two taxonomy files and combine them into
a single taxonomy sqlite database.

```
sourmash tax prepare --taxonomy file1.csv file2.csv -o tax.db
```

Input databases formats can be mixed and matched, and the output format
can be set to CSV like so:
```
sourmash tax prepare --taxonomy file1.csv file2.db -o tax.csv -F csv
```

## `sourmash lca` subcommands for in-memory taxonomy integration

Expand Down Expand Up @@ -1216,9 +1242,10 @@ Briefly,
signature files using `--query-from-file` (see below).

* `index` and `lca index` take a few fixed parameters (database name,
taxonomy spreadsheet) and then an arbitrary number of other files
that contain signatures, including files, directories, and indexed
databases. These commands will also take `--from-file` (see below).
and for `lca index`, a taxonomy file) and then an arbitrary number of
other files that contain signatures, including files, directories,
and indexed databases. These commands will also take `--from-file`
(see below).

None of these commands currently support searching, comparing, or indexing
signatures with multiple ksizes or moltypes at the same time; you need
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/tax/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from . import metagenome
from . import genome
from . import annotate
from . import prepare
from ..utils import command_list
from argparse import SUPPRESS, RawDescriptionHelpFormatter
import os
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/tax/annotate.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def subparser(subparsers):
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', metavar='FILE',
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs="+", required=True,
help='database lineages CSV'
)
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/tax/genome.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def subparser(subparsers):
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', metavar='FILE',
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs='+', required=True,
help='database lineages CSV'
)
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/tax/metagenome.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def subparser(subparsers):
help='directory for output files'
)
subparser.add_argument(
'-t', '--taxonomy-csv', metavar='FILE',
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs='+', required=True,
help='database lineages CSV'
)
Expand Down
60 changes: 60 additions & 0 deletions src/sourmash/cli/tax/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""combine multiple taxonomy databases into one."""

usage="""

sourmash tax prepare --taxonomy-csv <taxonomy_file> [ ... ] -o <output>

The 'tax prepare' command reads in one or more taxonomy databases
and saves them into a new database. It can be used to combine databases
in the desired order, as well as output different database formats.

Please see the 'tax prepare' documentation for more details:
https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-prepare-prepare-and-or-combine-taxonomy-files
"""

import sourmash
from sourmash.logging import notify, print_results, error


def subparser(subparsers):
subparser = subparsers.add_parser('prepare',
usage=usage)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs="+", required=True,
help='database lineages'
)
subparser.add_argument(
'-o', '--output', required=True,
help='output file',
)
subparser.add_argument(
'-F', '--database-format',
help="format of output file; default is 'sql')",
default='sql',
choices=['csv', 'sql'],
)
subparser.add_argument(
'--keep-full-identifiers', action='store_true',
help='do not split identifiers on whitespace'
)
subparser.add_argument(
'--keep-identifier-versions', action='store_true',
help='after splitting identifiers, do not remove accession versions'
)
subparser.add_argument(
'--fail-on-missing-taxonomy', action='store_true',
help='fail quickly if taxonomy is not available for an identifier',
)
subparser.add_argument(
'-f', '--force', action = 'store_true',
help='continue past errors in file and taxonomy loading',
)

def main(args):
import sourmash
return sourmash.tax.__main__.prepare(args)
111 changes: 60 additions & 51 deletions src/sourmash/tax/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from sourmash.lca.lca_utils import display_lineage

from . import tax_utils
from .tax_utils import ClassificationResult
from .tax_utils import ClassificationResult, MultiLineageDB

usage='''
sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
Expand Down Expand Up @@ -61,22 +61,16 @@ def metagenome(args):
set_quiet(args.quiet)

# first, load taxonomic_assignments
tax_assign = {}
available_ranks = set()
for tax_csv in args.taxonomy_csv:

try:
this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
force=args.force)
# maybe check for overlapping tax assignments? currently, later ones will override earlier ones
tax_assign.update(this_tax_assign)
available_ranks.update(set(avail_ranks))

except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force)
available_ranks = tax_assign.available_ranks
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

if not tax_assign:
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)
Expand All @@ -103,7 +97,7 @@ def metagenome(args):
seen_perfect = set()
for rank in sourmash.lca.taxlist(include_strain=False):
summarized_gather[rank], seen_perfect = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
seen_perfect = seen_perfect)

Expand Down Expand Up @@ -139,20 +133,16 @@ def genome(args):
set_quiet(args.quiet)

# first, load taxonomic_assignments
tax_assign = {}
available_ranks = set()
for tax_csv in args.taxonomy_csv:

try:
this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
force=args.force)
# maybe check for overlapping tax assignments? currently later ones will override earlier ones
tax_assign.update(this_tax_assign)
available_ranks.update(set(avail_ranks))
except ValueError as exc:
error(f"ERROR: {str(exc)}")

try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force)
available_ranks = tax_assign.available_ranks
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

if not tax_assign:
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)
Expand Down Expand Up @@ -184,7 +174,7 @@ def genome(args):
# if --rank is specified, classify to that rank
if args.rank:
best_at_rank, seen_perfect = tax_utils.summarize_gather_at(args.rank, tax_assign, gather_results, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
best_only=True, seen_perfect=seen_perfect)

Expand All @@ -210,7 +200,7 @@ def genome(args):
for rank in tax_utils.ascending_taxlist(include_strain=False):
# gets best_at_rank for all queries in this gather_csv
best_at_rank, seen_perfect = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
best_only=True, seen_perfect=seen_perfect)

Expand Down Expand Up @@ -257,22 +247,16 @@ def annotate(args):
set_quiet(args.quiet)

# first, load taxonomic_assignments
tax_assign = {}
this_tax_assign = None
for tax_csv in args.taxonomy_csv:

try:
this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
force=args.force)
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

# maybe check for overlapping tax assignments? currently later ones will override earlier ones
if this_tax_assign:
tax_assign.update(this_tax_assign)

try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force)
available_ranks = tax_assign.available_ranks
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

if not tax_assign:
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)
Expand Down Expand Up @@ -300,12 +284,37 @@ def annotate(args):
for row in gather_results:
match_ident = row['name']
lineage = tax_utils.find_match_lineage(match_ident, tax_assign, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions)
row['lineage'] = display_lineage(lineage)
w.writerow(row)


def prepare(args):
"Combine multiple taxonomy databases into one and/or translate formats."
notify("loading taxonomies...")
try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions)
except ValueError as exc:
error("ERROR while loading taxonomies!")
error(str(exc))
sys.exit(-1)

notify(f"...loaded {len(tax_assign)} entries.")

notify(f"saving to '{args.output}', format {args.database_format}...")
try:
tax_assign.save(args.output, args.database_format)
except ValueError as exc:
error("ERROR while saving!")
error(str(exc))
sys.exit(-1)

notify("done!")


def main(arglist=None):
args = sourmash.cli.get_parser().parse_args(arglist)
submod = getattr(sourmash.cli.sig, args.subcmd)
Expand Down
Loading