qiyunzhu · qiyunzhu · Dec 23, 2022 · Dec 21, 2022 · Dec 21, 2022 · Dec 22, 2022
diff --git a/doc/cli.md b/doc/cli.md
@@ -24,7 +24,7 @@ Option | Description
 `--filext`, `-e` | Input filename extension following sample ID.
 `--samples`, `-s` | List of sample IDs to be included.
 `--demux/--no-demux` | Demultiplex alignment by first underscore in query identifier.
-`--trim-sub` | Trim subject IDs at the last given delimiter.
+`--trim-sub` | Trim subject IDs at the last given delimiter. Can accept the default value "_" or enter a custom value.
 
 ### Hierarchies
 
@@ -145,7 +145,8 @@ Option | Description
 `--map`, `-m` (required) | Path to mapping of source features to target features.
 `--output`, `-o` (required) | Path to output profile.
 `--divide`, `-d` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
-`--field`, `-f` | Index of field to be collapsed in a stratified profile. For example, use `-f 2` to collapse "gene" in "microbe\|gene".
+`--field`, `-f` | Features are stratified (strata delimited by "\|"). For example, if features are like "microbe\|gene", one can use `-f 1` to collapse "microbe" or `-f 2` to collapse "gene".
+`--suffix`, `-s` | Features have suffixes that indicate parent-child relationships. For example, "A_1" represents "1" of "A", and the entire feature is equivalent to "A\|A_1". Can accept the default delimiter "_" or enter a custom delimiter. This parameter overrides the "\|"-delimited strata.
 `--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
 
 ### Coverage

diff --git a/doc/input.md b/doc/input.md
@@ -120,6 +120,10 @@ woltka classify \
 
 ## Subject trimming
 
-The parameter `--trim-sub <delim>` lets Woltka trim subject IDs at the last occurrence of the given delimiter. Examples include trimming off version numbers from NCBI accessions (`--trim-sub .`: `NP_123456.1` => `NP_123456`, trimming off ORF indices from nucleotide sequences (`--trim-sub _`: `Contig_5_20` => `Contig_5`).
+The parameter `--trim-sub <delim>` lets Woltka trim subject IDs at the last occurrence of the given delimiter (default: "_"). For examples:
+
+1. Trim off ORF indices from nucleotide sequences: Add `--trim-sub` (without a value). Effect: "Contig_5_20" becomes "Contig_5".
+
+2. Trim off version numbers from NCBI accessions: Add `--trim-sub .` (dot). Effect: "NP_123456.1" becomes "NP_123456".
 
 This allows flexible adaptation to alternative classification systems without manually editing the alignment files. A use case is the stratified structural/functional classification (see [details](stratify.md)).
diff --git a/doc/stratify.md b/doc/stratify.md
@@ -66,12 +66,12 @@ Firmicutes\|GO:0000006 | 0 | 0 | 0 | 18
 
 With this method, we start with the alignments of reads against reference **genes** (not genomes). This methods is faster, with a cost in accuracy (since genes only represent a proportion of their host genomes, and are confounded by homology, copy number, boundaries etc.). Therefore, this methods only demonstrates what "_you could_". However, if you already obtained and decided to proceed with gene alignments (for example, you already ran BLAST on UniProt), this method is worth consideration.
 
-The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID <underscore> index` (e.g., `G000123456_20`). With parameter `--trim-sub _`, Woltka converts them to genome IDs, and everything after is straightforward.
+The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID <underscore> index` (e.g., `G000123456_20`). With flag `--trim-sub`, Woltka converts them to genome IDs, and everything after is straightforward.
 
 ```bash
 woltka classify \
   -i align/diamond \
-  --trim-sub _ \
+  --trim-sub \
   --map taxonomy/taxid.map \
   --nodes taxonomy/nodes.dmp \
   --names taxonomy/names.dmp \

diff --git a/woltka/biom.py b/woltka/biom.py
@@ -194,7 +194,46 @@ def biom_add_metacol(table: biom.Table, dic, name, missing=''):
     table.add_metadata(metadata, axis='observation')
 
 
-def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
+def clip_biom(table: biom.Table, sep='_'):
+    """Remove suffix from feature names in a BIOM table.
+
+    Parameters
+    ----------
+    table : biom.Table
+        Table to collapse.
+    sep : str, optional
+        Separator (after last of which is suffix).
+
+    Returns
+    -------
+    biom.Table
+        Clipped BIOM table.
+
+    Raises
+    ------
+    ValueError
+        A feature ID does not have a suffix.
+
+    Notes
+    -----
+    Metadata will not be retained in the collapsed table.
+
+    See Also
+    --------
+    .table.clip_table
+    """
+    def f(id_, _):
+        left, _, _ = id_.rpartition(sep)
+        if not left:
+            raise ValueError(f'Feature "{id_}" does not have a suffix.')
+        return left
+
+    return table.collapse(f, norm=False, axis='observation',
+                          include_collapsed_metadata=False)
+
+
+def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None,
+                  suffix=None):
     """Collapse a BIOM table in many-to-many mode.
 
     Parameters
@@ -207,6 +246,8 @@ def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
         Whether divide per-target counts by number of targets per source.
     field : int, optional
         Index of field to be collapsed in a stratified table.
+    suffix : str, optional
+        Feature names have a suffix following this delimiter.
 
     Returns
     -------
@@ -216,7 +257,7 @@ def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
     Raises
     ------
     ValueError
-        Field index is not present in a feature ID.
+        A feature ID does not have a suffix or a field.
 
     Notes
     -----
@@ -230,7 +271,16 @@ def collapse_biom(table: biom.Table, mapping: dict, divide=False, field=None):
     metadata = {}
     for id_ in table.ids('observation'):
         feature = id_
-        if field is not None:
+        if suffix:
+            left, _, _ = feature.rpartition(suffix)
+            if not left:
+                raise ValueError(
+                    f'Feature "{feature}" does not have a suffix.')
+            if field is not None:
+                fields = [left, feature]
+            if not field:
+                feature = left
+        elif field is not None:
             fields = feature.split('|')
             try:
                 feature = fields[field]

diff --git a/woltka/cli.py b/woltka/cli.py
@@ -61,8 +61,9 @@ def cli():
     '--demux/--no-demux', default=None,
     help='Demultiplex alignment by first underscore in query identifier.')
 @click.option(
-    '--trim-sub', 'trimsub',
-    help='Trim subject IDs at the last given delimiter.')
+    '--trim-sub', 'trimsub', is_flag=False, flag_value='_',
+    help=('Trim subject IDs at the last given delimiter. Default: "_", or '
+          'enter a custom value.'))
 # hierarchies
 @click.option(
     '--nodes', 'nodes_fps', type=click.Path(exists=True), multiple=True,
@@ -260,22 +261,28 @@ def merge_cmd(ctx, **kwargs):
     '--input', '-i', 'input_fp', required=True,
     type=click.Path(exists=True, dir_okay=False),
     help='Path to input profile.')
-@click.option(
-    '--map', '-m', 'map_fp', required=True,
-    type=click.Path(exists=True, dir_okay=False),
-    help=('Mapping of source features to target features. (supports '
-          'many-to-many relationships).'))
 @click.option(
     '--output', '-o', 'output_fp', required=True,
     type=click.Path(writable=True, dir_okay=False),
     help='Path to output profile.')
+@click.option(
+    '--map', '-m', 'map_fp',
+    type=click.Path(exists=True, dir_okay=False),
+    help=('Mapping of source features to target features. Supports '
+          'many-to-many relationships. Required unless with "-s"'))
 @click.option(
     '--divide', '-d', is_flag=True,
     help=('Count each target feature as 1/k (k is the number of targets '
           'mapped to a source). Otherwise, count as one.'))
 @click.option(
     '--field', '-f', type=click.INT,
-    help='Index of field to be collapsed in a stratified profile.')
+    help=('Features are stratified (strata delimited by "|"), and the x-th '
+          'field is to be collapsed, while other fields stay the same.'))
+@click.option(
+    '--suffix', '-s', is_flag=False, flag_value='_',
+    help=('Features have suffixes that indicate parent-child relationships. '
+          'For example, "A_1" represents "1" of "A". Enter an delimiter if '
+          'not "_". Overrides "|"-delimited strata.'))
 @click.option(
     '--names', '-n', 'names_fp', type=click.Path(exists=True),
     help='Names of target features to append to the output profile.')

diff --git a/woltka/ordinal.py b/woltka/ordinal.py
@@ -628,8 +628,8 @@ def match_read_gene_quart(geneque, readque):
 
                 # when gene ends, check overlap and remove it from cache
                 # gene end must be <= read end
-                # if gene not found in cache (meaning gene started before read
-                # region), use read start, otherwise use gene start
+                # if gene not found in cache (gene started before read region),
+                # use read start, otherwise use gene start
                 elif (code >> 48) - within_pop(code & (1 << 30) - 1, beg) >= L:
                     yield rid, code & (1 << 30) - 1
 

diff --git a/woltka/table.py b/woltka/table.py
@@ -21,7 +21,8 @@
 from .file import openzip
 from .biom import (
     table_to_biom, biom_to_table, write_biom, biom_max_f, divide_biom,
-    scale_biom, filter_biom, round_biom, biom_add_metacol, collapse_biom)
+    scale_biom, filter_biom, round_biom, biom_add_metacol, clip_biom,
+    collapse_biom)
 
 
 def prep_table(profile, samples=None, tree=None, rankdic=None, namedic=None,
@@ -234,8 +235,7 @@ def read_tsv(fh):
     width = len(header)
     for line in fh:
         row = line.rstrip('\r\n').split('\t')
-        feature = row[0]
-        features.append(feature)
+        features.append(row[0])
         data.append([int(x) if x.isdigit() else float(x)
                      for x in row[1:width]])
         # data.append(list(map(int, row[1:width])))
@@ -560,7 +560,45 @@ def add_metacol(table, dic, name, missing=''):
         metadatum[name] = dic.get(feature, missing)
 
 
-def collapse_table(table, mapping, divide=False, field=None):
+def clip_table(table, sep='_'):
+    """Remove suffix from feature names in a table.
+
+    Parameters
+    ----------
+    table : biom.Table or tuple
+        Table to collapse.
+    sep : str, optional
+        Separator (after last of which is suffix).
+
+    Returns
+    -------
+    biom.Table or tuple
+        Clipped table.
+
+    Raises
+    ------
+    ValueError
+        A feature ID does not have a suffix.
+    """
+    # redirect to BIOM module
+    if isinstance(table, Table):
+        return clip_biom(table, sep)
+
+    # remove suffix from feature names
+    samples = table[2]
+    width = len(samples)
+    res = defaultdict(lambda: [0] * width)
+    for datum, feature in zip(*table[:2]):
+        left, _, _ = feature.rpartition(sep)
+        if not left:
+            raise ValueError(f'Feature "{feature}" does not have a suffix.')
+        res[left] = list(map(add, res[left], datum))
+
+    # reformat table
+    return list(res.values()), list(res.keys()), samples, [dict() for _ in res]
+
+
+def collapse_table(table, mapping, divide=False, field=None, suffix=None):
     """Collapse a table by many-to-many mapping.
 
     Parameters
@@ -573,6 +611,8 @@ def collapse_table(table, mapping, divide=False, field=None):
         Whether divide per-target counts by number of targets per source.
     field : int, optional
         Index of field to be collapsed in a stratified table.
+    suffix : str, optional
+        Feature names have a suffix following this delimiter.
 
     Returns
     -------
@@ -582,34 +622,57 @@ def collapse_table(table, mapping, divide=False, field=None):
     Raises
     ------
     ValueError
-        Field index is not present in a feature ID.
+        A feature ID does not have a suffix or a field.
 
     Notes
     -----
     Metadata will not be retained in the collapsed table.
     """
     # redirect to BIOM module
     if isinstance(table, Table):
-        return collapse_biom(table, mapping, divide, field)
+        return collapse_biom(table, mapping, divide, suffix, field)
 
     # collapse table
     samples = table[2]
     width = len(samples)
     res = defaultdict(lambda: [0] * width)
     for datum, feature in zip(*table[:2]):
-        if field is not None:
+
+        # suffixed feature
+        if suffix:
+            left, _, _ = feature.rpartition(suffix)
+            if not left:
+                raise ValueError(
+                    f'Feature "{feature}" does not have a suffix.')
+
+            # get fields
+            if field is not None:
+                fields = [left, feature]
+
+            # collapse field to parent
+            if not field:
+                feature = left
+
+        # stratified feature
+        elif field is not None:
             fields = feature.split('|')
             try:
                 feature = fields[field]
             except IndexError:
                 raise ValueError(
                     f'Feature "{feature}" has less than {field + 1} fields.')
+
+        # map features to targets
         if feature not in mapping:
             continue
         targets = mapping[feature]
+
+        # divide feature count by target number
         if divide:
             k = 1 / len(targets)
             datum = [x * k for x in datum]
+
+        # add results to same target
         for target in targets:
             if field is not None:
                 fields[field] = target

diff --git a/woltka/tests/data/README.md b/woltka/tests/data/README.md
@@ -152,7 +152,7 @@ Same as above, adding:
 ```bash
 woltka classify \
   --input align/burst/split \
-  --trim-sub _ \
+  --trim-sub \
   --map taxonomy/nucl/nucl2tid.txt \
   --nodes taxonomy/nodes.dmp \
   --names taxonomy/names.dmp \
@@ -186,7 +186,7 @@ woltka tools merge \
 ```bash
 woltka classify \
   --input align/diamond \
-  --trim-sub _ \
+  --trim-sub \
   --map taxonomy/taxid.map \
   --nodes taxonomy/nodes.dmp \
   --rank free \