Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added suffix collapse #173

Merged
merged 7 commits into from
Dec 23, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Option | Description
`--filext`, `-e` | Input filename extension following sample ID.
`--samples`, `-s` | List of sample IDs to be included.
`--demux/--no-demux` | Demultiplex alignment by first underscore in query identifier.
`--trim-sub` | Trim subject IDs at the last given delimiter.
`--trim-sub` | Trim subject IDs at the last given delimiter. Can accept the default value "_" or enter a custom value.

### Hierarchies

Expand Down Expand Up @@ -145,7 +145,8 @@ Option | Description
`--map`, `-m` (required) | Path to mapping of source features to target features.
`--output`, `-o` (required) | Path to output profile.
`--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
`--field`, `-f` | Index of field to be collapsed in a stratified profile. For example, use `-f 2` to collapse "gene" in "microbe\|gene".
`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "microbe\|gene", one can use `-f 1` to collapse "microbe" or `-f 2` to collapse "gene".
`--suffix`, `-s` | Features have suffixes that indicate parent-child relationships. For example, "A_1" represents "1" of "A", and the entire feature is equivalent to "A\|A_1". Can accept the default delimiter "_" or enter a custom delimiter. This parameter overrides the "\|"-delimited strata.
`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.

### Coverage
Expand Down
6 changes: 5 additions & 1 deletion doc/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,10 @@ woltka classify \

## Subject trimming

The parameter `--trim-sub <delim>` lets Woltka trim subject IDs at the last occurrence of the given delimiter. Examples include trimming off version numbers from NCBI accessions (`--trim-sub .`: `NP_123456.1` => `NP_123456`, trimming off ORF indices from nucleotide sequences (`--trim-sub _`: `Contig_5_20` => `Contig_5`).
The parameter `--trim-sub <delim>` lets Woltka trim subject IDs at the last occurrence of the given delimiter (default: "_"). For examples:

1. Trim off ORF indices from nucleotide sequences: Add `--trim-sub` (without a value). Effect: "Contig_5_20" becomes "Contig_5".

2. Trim off version numbers from NCBI accessions: Add `--trim-sub .` (dot). Effect: "NP_123456.1" becomes "NP_123456".

This allows flexible adaptation to alternative classification systems without manually editing the alignment files. A use case is the stratified structural/functional classification (see [details](stratify.md)).
4 changes: 2 additions & 2 deletions doc/stratify.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ Firmicutes\|GO:0000006 | 0 | 0 | 0 | 18

With this method, we start with the alignments of reads against reference **genes** (not genomes). This methods is faster, with a cost in accuracy (since genes only represent a proportion of their host genomes, and are confounded by homology, copy number, boundaries etc.). Therefore, this methods only demonstrates what "_you could_". However, if you already obtained and decided to proceed with gene alignments (for example, you already ran BLAST on UniProt), this method is worth consideration.

The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID <underscore> index` (e.g., `G000123456_20`). With parameter `--trim-sub _`, Woltka converts them to genome IDs, and everything after is straightforward.
The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID <underscore> index` (e.g., `G000123456_20`). With flag `--trim-sub`, Woltka converts them to genome IDs, and everything after is straightforward.

```bash
woltka classify \
-i align/diamond \
--trim-sub _ \
--trim-sub \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
Expand Down
56 changes: 53 additions & 3 deletions woltka/biom.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,46 @@ def biom_add_metacol(table: biom.Table, dic, name, missing=''):
table.add_metadata(metadata, axis='observation')


def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
def clip_biom(table: biom.Table, sep='_'):
"""Remove suffix from feature names in a BIOM table.

Parameters
----------
table : biom.Table
Table to collapse.
sep : str, optional
Separator (after last of which is suffix).

Returns
-------
biom.Table
Clipped BIOM table.

Raises
------
ValueError
A feature ID does not have a suffix.

Notes
-----
Metadata will not be retained in the collapsed table.

See Also
--------
.table.clip_table
"""
def f(id_, _):
left, _, _ = id_.rpartition(sep)
if not left:
raise ValueError(f'Feature "{id_}" does not have a suffix.')
return left

return table.collapse(f, norm=False, axis='observation',
include_collapsed_metadata=False)


def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None,
suffix=None):
"""Collapse a BIOM table in many-to-many mode.

Parameters
Expand All @@ -207,6 +246,8 @@ def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
Whether divide per-target counts by number of targets per source.
field : int, optional
Index of field to be collapsed in a stratified table.
suffix : str, optional
Feature names have a suffix following this delimiter.

Returns
-------
Expand All @@ -216,7 +257,7 @@ def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
Raises
------
ValueError
Field index is not present in a feature ID.
A feature ID does not have a suffix or a field.

Notes
-----
Expand All @@ -230,7 +271,16 @@ def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
metadata = {}
for id_ in table.ids('observation'):
feature = id_
if field is not None:
if suffix:
left, _, _ = feature.rpartition(suffix)
if not left:
raise ValueError(
f'Feature "{feature}" does not have a suffix.')
if field is not None:
fields = [left, feature]
if not field:
feature = left
elif field is not None:
fields = feature.split('|')
try:
feature = fields[field]
Expand Down
23 changes: 15 additions & 8 deletions woltka/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,9 @@ def cli():
'--demux/--no-demux', default=None,
help='Demultiplex alignment by first underscore in query identifier.')
@click.option(
'--trim-sub', 'trimsub',
help='Trim subject IDs at the last given delimiter.')
'--trim-sub', 'trimsub', is_flag=False, flag_value='_',
help=('Trim subject IDs at the last given delimiter. Default: "_", or '
'enter a custom value.'))
# hierarchies
@click.option(
'--nodes', 'nodes_fps', type=click.Path(exists=True), multiple=True,
Expand Down Expand Up @@ -260,22 +261,28 @@ def merge_cmd(ctx, **kwargs):
'--input', '-i', 'input_fp', required=True,
type=click.Path(exists=True, dir_okay=False),
help='Path to input profile.')
@click.option(
'--map', '-m', 'map_fp', required=True,
type=click.Path(exists=True, dir_okay=False),
help=('Mapping of source features to target features. (supports '
'many-to-many relationships).'))
@click.option(
'--output', '-o', 'output_fp', required=True,
type=click.Path(writable=True, dir_okay=False),
help='Path to output profile.')
@click.option(
'--map', '-m', 'map_fp',
type=click.Path(exists=True, dir_okay=False),
help=('Mapping of source features to target features. Supports '
'many-to-many relationships. Required unless with "-s"'))
@click.option(
'--divide', '-d', is_flag=True,
help=('Count each target feature as 1/k (k is the number of targets '
'mapped to a source). Otherwise, count as one.'))
@click.option(
'--field', '-f', type=click.INT,
help='Index of field to be collapsed in a stratified profile.')
help=('Features are stratified (strata delimited by "|"), and the x-th '
'field is to be collapsed, while other fields stay the same.'))
@click.option(
'--suffix', '-s', is_flag=False, flag_value='_',
help=('Features have suffixes that indicate parent-child relationships. '
'For example, "A_1" represents "1" of "A". Enter an delimiter if '
'not "_". Overrides "|"-delimited strata.'))
@click.option(
'--names', '-n', 'names_fp', type=click.Path(exists=True),
help='Names of target features to append to the output profile.')
Expand Down
4 changes: 2 additions & 2 deletions woltka/ordinal.py
Original file line number Diff line number Diff line change
Expand Up @@ -628,8 +628,8 @@ def match_read_gene_quart(geneque, readque):

# when gene ends, check overlap and remove it from cache
# gene end must be <= read end
# if gene not found in cache (meaning gene started before read
# region), use read start, otherwise use gene start
# if gene not found in cache (gene started before read region),
# use read start, otherwise use gene start
elif (code >> 48) - within_pop(code & (1 << 30) - 1, beg) >= L:
yield rid, code & (1 << 30) - 1

Expand Down
77 changes: 70 additions & 7 deletions woltka/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
from .file import openzip
from .biom import (
table_to_biom, biom_to_table, write_biom, biom_max_f, divide_biom,
scale_biom, filter_biom, round_biom, biom_add_metacol, collapse_biom)
scale_biom, filter_biom, round_biom, biom_add_metacol, clip_biom,
collapse_biom)


def prep_table(profile, samples=None, tree=None, rankdic=None, namedic=None,
Expand Down Expand Up @@ -234,8 +235,7 @@ def read_tsv(fh):
width = len(header)
for line in fh:
row = line.rstrip('\r\n').split('\t')
feature = row[0]
features.append(feature)
features.append(row[0])
data.append([int(x) if x.isdigit() else float(x)
for x in row[1:width]])
# data.append(list(map(int, row[1:width])))
Expand Down Expand Up @@ -560,7 +560,45 @@ def add_metacol(table, dic, name, missing=''):
metadatum[name] = dic.get(feature, missing)


def collapse_table(table, mapping, divide=False, field=None):
def clip_table(table, sep='_'):
"""Remove suffix from feature names in a table.

Parameters
----------
table : biom.Table or tuple
Table to collapse.
sep : str, optional
Separator (after last of which is suffix).

Returns
-------
biom.Table or tuple
Clipped table.

Raises
------
ValueError
A feature ID does not have a suffix.
"""
# redirect to BIOM module
if isinstance(table, Table):
return clip_biom(table, sep)

# remove suffix from feature names
samples = table[2]
width = len(samples)
res = defaultdict(lambda: [0] * width)
for datum, feature in zip(*table[:2]):
left, _, _ = feature.rpartition(sep)
if not left:
raise ValueError(f'Feature "{feature}" does not have a suffix.')
res[left] = list(map(add, res[left], datum))

# reformat table
return list(res.values()), list(res.keys()), samples, [dict() for _ in res]


def collapse_table(table, mapping, divide=False, field=None, suffix=None):
"""Collapse a table by many-to-many mapping.

Parameters
Expand All @@ -573,6 +611,8 @@ def collapse_table(table, mapping, divide=False, field=None):
Whether divide per-target counts by number of targets per source.
field : int, optional
Index of field to be collapsed in a stratified table.
suffix : str, optional
Feature names have a suffix following this delimiter.

Returns
-------
Expand All @@ -582,34 +622,57 @@ def collapse_table(table, mapping, divide=False, field=None):
Raises
------
ValueError
Field index is not present in a feature ID.
A feature ID does not have a suffix or a field.

Notes
-----
Metadata will not be retained in the collapsed table.
"""
# redirect to BIOM module
if isinstance(table, Table):
return collapse_biom(table, mapping, divide, field)
return collapse_biom(table, mapping, divide, suffix, field)

# collapse table
samples = table[2]
width = len(samples)
res = defaultdict(lambda: [0] * width)
for datum, feature in zip(*table[:2]):
if field is not None:

# suffixed feature
if suffix:
left, _, _ = feature.rpartition(suffix)
if not left:
raise ValueError(
f'Feature "{feature}" does not have a suffix.')

# get fields
if field is not None:
fields = [left, feature]

# collapse field to parent
if not field:
feature = left

# stratified feature
elif field is not None:
fields = feature.split('|')
try:
feature = fields[field]
except IndexError:
raise ValueError(
f'Feature "{feature}" has less than {field + 1} fields.')

# map features to targets
if feature not in mapping:
continue
targets = mapping[feature]

# divide feature count by target number
if divide:
k = 1 / len(targets)
datum = [x * k for x in datum]

# add results to same target
for target in targets:
if field is not None:
fields[field] = target
Expand Down
4 changes: 2 additions & 2 deletions woltka/tests/data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ Same as above, adding:
```bash
woltka classify \
--input align/burst/split \
--trim-sub _ \
--trim-sub \
--map taxonomy/nucl/nucl2tid.txt \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
Expand Down Expand Up @@ -186,7 +186,7 @@ woltka tools merge \
```bash
woltka classify \
--input align/diamond \
--trim-sub _ \
--trim-sub \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--rank free \
Expand Down
Loading