added q2 plugin and doc

mataton · Feb 6, 2021 · f2ef68b · f2ef68b
1 parent 776f7cc
commit f2ef68b
Show file tree

Hide file tree

Showing 21 changed files with 270 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -17,13 +17,15 @@ Woltka ships with a **QIIME 2 plugin**. [See here for instructions](woltka/q2).
 - [Overview](#overview)
 - [Installation](#installation)
 - [Example usage](#example-usage)
-- Details
+- Main workflow
   - [Input files](doc/input.md)
   - [Output files](doc/output.md)
   - [Classification systems](doc/hierarchy.md)
   - [Classification methods](doc/classify.md)
   - [Coordinates matching](doc/ordinal.md)
   - [Stratification](doc/stratify.md)
+- Profile tools
+  - [Collapsing](doc/collapse.md)
 - Tutorials
   - [Working with WoL](doc/wol.md)
   - [gOTU analysis](doc/gotu.md)

diff --git a/doc/cli.md b/doc/cli.md
@@ -89,7 +89,7 @@ Option | Description
 
 ### Filter
 
-Filter a profile by per-sample abundance.
+Filter a profile by **per-sample** abundance.
 
 Option | Description
 --- | ---
@@ -106,3 +106,15 @@ Option | Description
 --- | ---
 `--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
 `--output`, `-o` (required) | Path to output profile.
+
+### Collapse
+
+Collapse a profile based on feature mapping (supports **many-to-many** mapping).
+
+Option | Description
+--- | ---
+`--input`, `-i` (required) | Path to input profile.
+`--map`, `-m` (required) | Path to mapping of source features to target features.
+`--output`, `-o` (required) | Path to output profile.
+`--normalize`, `-z` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
+`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
diff --git a/doc/collapse.md b/doc/collapse.md
@@ -0,0 +1,95 @@
+# Profile collapsing
+
+The **profile collapsing** function is a lightweight and flexible addition to the main classification workflow. It allows the user to convert an _existing_ profile based on a mapping of source features to target features. It highlights the support for **many-to-many mapping**.
+
+```bash
+woltka tools collapse -i input.biom -m mapping.txt -o output.biom
+```
+
+With this tool one can achieve the following goals:
+
+1. Translate feature IDs into names or descriptions.
+..* Example: Translate taxonomic IDs to taxon names.
+..* Example: Translate [UniRef](https://www.uniprot.org/help/uniref) IDs to protein names, while **merging** same names.
+
+2. Group lower features into higher categories.
+..* Example: Convert genera into families.
+
+3. Convert lower features into higher ones, where each lower feature may correspond to **multiple** higher features.
+..* Example: Convert KEGG [orthologs](https://www.genome.jp/kegg/ko.html) to [pathways](https://www.genome.jp/kegg/pathway.html).
+..* Example: Convert [GO](http://geneontology.org/docs/ontology-documentation/) terms to [GO Slim](http://www-legacy.geneontology.org/GO.slims.shtml) terms.
+
+The last usage is an important complement to the main classification workflow, which currently relies on a tree structure and does not support one-to-many mapping. This can be achieved by using the profile collapsing function (although one can only move up one level per run).
+
+
+## Mapping file format
+
+A mapping file is a text file with entries separated by tabs. The number of fields per line is _arbitrary_. The first field is the source feature ID. The second to last fields are target feature ID(s). Duplicates in sources or targets are allowed. For examples:
+
+1. One/many-to-one:
+```
+source1 <tab> target1
+source2 <tab> target2
+source3 <tab> target2
+source4 <tab> target3
+...
+```
+
+2. Many-to-many (multiple targets per line):
+```
+source1 <tab> target1
+source2 <tab> target1 <tab> target2
+source3 <tab> target2 <tab> target3 <tab> target4
+...
+```
+
+3. Many-to-many (multiple same sources):
+```
+source1 <tab> target1
+source1 <tab> target2
+source2 <tab> target2
+source3 <tab> target3
+source4 <tab> target3
+...
+```
+
+## Normalization
+
+By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--normalize` or `-z` flag added to the command, each target will be counted 1 / _k_ times.
+
+Whether to enable normalization depends on the nature and aim of your analysis. For example, one gene is involved in two pathways (which isn't uncommon), should each pathway be counted once, or half time?
+
+
+## Feature names
+
+Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will not be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").
+
+
+## Sample workflow
+
+The following example performs functional profiling of shotgun metagenomic data using the [MetaCyc](https://metacyc.org/) database. Compared with the alternative [one-liner solution](wol.md#coordinates-based-functional-classification-using-metacyc), this solution allows one lower unit to be mapped to multiple upper units.
+
+Run main classification workflow once to generate a profile of ORFs:
+
+```bash
+woltka classify -i input.sam --coords annotation/coords.txt.xz -o orf.tsv
+```
+
+Run profile collapsing multiple times and sequentially, each of which generating one additional profile:
+
+```bash
+# ORF to protein
+woltka tools collapse -i orf.tsv -m $mcdir/protein.map -n $mcdir/protein.names -o protein.tsv
+# protein to gene
+woltka tools collapse -i protein.tsv -m $mcdir/protein2gene.map -n $mcdir/gene.names -o gene.tsv
+# protein to enzymatic reaction
+woltka tools collapse -i protein.tsv -m $mcdir/protein2enzrxn.map -n $mcdir/enzrxn.names -o enzrxn.tsv
+# enzymatic reaction to reaction
+woltka tools collapse -i enzrxn.tsv -m $mcdir/enzrxn2reaction.map -n $mcdir/reaction.names -o reaction.tsv
+# reaction to EC number
+woltka tools collapse -i reaction.tsv -m $mcdir/reaction2ec.map -o ec.tsv
+# reaction to pathway
+woltka tools collapse -i reaction.tsv -m $mcdir/reaction2pathway.map -n $mcdir/pathway.names -o pathway.tsv
+# pathway to class
+woltka tools collapse -i pathway.tsv -m $mcdir/pathway2class.map -n $mcdir/class.names -o class.tsv
+```
diff --git a/doc/hierarchy.md b/doc/hierarchy.md
@@ -16,6 +16,7 @@ That being said, Woltka still supports ranked hierarchies and one can instruct t
 - [Taxon name dictionary](#taxon-name-dictionary)
 - [How Woltka handles hierarchies](#how-woltka-handles-hierarchies)
 - [Hierarchy file specs](#hierarchy-file-specifications)
+- [Multiple mapping](#multiple-mapping)
 
 
 ## Supported hierarchy files
@@ -220,3 +221,8 @@ With flag `--map-as-rank`, Woltka will extract a **rank** name from the filename
 - `prot2taxid.txt.gz` => `taxid`
 - `reaction_to_pathway.tsv` => `pathway`
 - `apple-to-orange` = > `orange`
+
+
+## Multiple mapping
+
+As explained above, Woltka's main classification workflow relies on a tree-structured hierarchy, i.e., a classification unit can only point to one parent classification unit. However, in some scenarios one may want to perform one-to-many mapping. Examples include functional analyses, where one gene may be involved in multiple pathways. Woltka provides a [profile collapsing](collapse.md) tool to enable this analysis ([see details](collapse.md)).
diff --git a/doc/input.md b/doc/input.md
@@ -11,7 +11,6 @@ The input files for Woltka are sequence **alignment** files. The term "alignment
 - [Demultiplexing](#demultiplexing)
 - [Subject trimming](#subject-trimming)
 
-
 ## Input filepath
 
 Parameter `--input` or `-i` is to let Woltka know where to find the input alignment file(s). It can be any of the following three scenarios:

diff --git a/doc/install.md b/doc/install.md
@@ -4,7 +4,6 @@
 
 There is no restriction as far as we are aware of. Tested and working on Linux, macOS and Windows systems.
 
-
 ## Software environment
 
 Woltka is written in **Python 3**. One needs at least Python 3.6 to run the program.

diff --git a/doc/ordinal.md b/doc/ordinal.md
@@ -59,7 +59,7 @@ Woltka considers a read-to-gene match if their **overlapping** region reaches a
 
 ## Example command
 
-With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling:
+With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling (also see [here](collapse.md#sample-workflow)):
 
 ```bash
 woltka classify \

diff --git a/doc/output.md b/doc/output.md
@@ -1,6 +1,6 @@
 # Output files
 
-The main output file(s) of Woltka are **feature tables**, in which column headers are sample IDs and row headers are feature IDs, and each cell value represents the _count_ of a particular feature in a particular sample. For example:
+The main output file(s) of Woltka are **feature tables** (a.k.a., **profiles**), in which column headers are sample IDs and row headers are feature IDs, and each cell value represents the _count_ of a particular feature in a particular sample. For example:
 
 Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Name | Rank
 --- | --- | --- | --- | --- | --- | ---
@@ -15,6 +15,7 @@ Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Name | Rank
 - [Output filepath](#output-filepath)
 - [File formats](#file-formats)
 - [Output read maps](#output-read-maps)
+- [Table utilities](#table-utilities)
 
 ## Output filepath
 
@@ -132,3 +133,11 @@ woltka classify \
 ```
 
 In `outmap_dir`, there will be three subdirectories: `phylum`, `genus` and `species`, eaching holding three read map files: `S1.txt.xz`, `S2.txt.xz` and `S3.txt.xz`.
+
+## Table utilities
+
+Woltka provides several utilities under the `tools` menu for table manipulation (both BIOM and TSV are supported and automatically recognized). Here are details:
+
+- [**collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
+- **filter**: Filter a profile by per-sample abundance.
+- **merge**: Merge multiple profiles into one profile.
diff --git a/doc/wol.md b/doc/wol.md
@@ -74,6 +74,8 @@ woltka classify \
 
 ## Coordinates-based functional classification using MetaCyc
 
+Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files.
+
 ```bash
 mcdir=annotation/metacyc
 woltka classify \
@@ -90,6 +92,8 @@ woltka classify \
   --output output_dir
 ```
 
+Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](collapse.md#sample-workflow).
+
 ## Stratified taxonomic / functional classification
 
 Say, you want to stratify functional annotations by genus (taxonomy). First, run taxonomic classification at the genus level, and export read-to-genus maps:

diff --git a/woltka/cli.py b/woltka/cli.py
@@ -243,20 +243,19 @@ def merge_cmd(ctx, **kwargs):
 @click.option(
     '--map', '-m', 'map_fp', required=True,
     type=click.Path(exists=True, dir_okay=False),
-    help=('Mapping of lower classification units to higher ones (supports '
+    help=('Mapping of source features to target features. (supports '
           'many-to-many relationships).'))
 @click.option(
     '--output', '-o', 'output_fp', required=True,
     type=click.Path(writable=True, dir_okay=False),
     help='Path to output profile.')
 @click.option(
     '--normalize', '-z', is_flag=True,
-    help=('Count each higher classification unit as 1/k (k is the number of '
-          'higher classification units mapped to a lower one). Otherwise, '
-          'count as one.'))
+    help=('Count each target feature as 1/k (k is the number of targets '
+          'mapped to a source). Otherwise, count as one.'))
 @click.option(
     '--names', '-n', 'names_fp', type=click.Path(exists=True),
-    help='Names of higher classification units to append to output profile.')
+    help='Names of target features to append to the output profile.')
 @click.pass_context
 def collapse_cmd(ctx, **kwargs):
     """Collapse a profile based on feature mapping.

diff --git a/woltka/file.py b/woltka/file.py
@@ -424,6 +424,46 @@ def read_map_all(fh, sep='\t'):
             yield key, rest.rstrip().split(sep)
 
 
+def read_map_many(fh, sep='\t'):
+    """Read many-to-many relationships from a mapping file.
+
+    Parameters
+    ----------
+    fh : file handle
+        Mapping file.
+
+    Returns
+    -------
+    dict of list
+        Key-value(s) pair.
+
+    Notes
+    -----
+    This is a general solution which applies to the following formats:
+    1 (one-to-one):
+        source1 <tab> target1
+        source2 <tab> target2
+        source3 <tab> target3
+        ...
+    2 (one-to-many):
+        source1 <tab> target1
+        source2 <tab> target1 <tab> target2
+        source3 <tab> target2 <tab> target3 <tab> target4
+        ...
+    3 (many-to-many):
+        source1 <tab> target1
+        source1 <tab> target2
+        source2 <tab> target2
+        source3 <tab> target3
+        source4 <tab> target3
+        ...
+    """
+    res = {}
+    for key, values in read_map_all(fh, sep):
+        res.setdefault(key, []).extend(values)
+    return res
+
+
 def write_readmap(fh, qryque, taxque, namedic=None):
     """Write a read map to a tab-delimited file.
 

diff --git a/woltka/q2/README.md b/woltka/q2/README.md
@@ -133,10 +133,10 @@ qiime diversity core-metrics-phylogenetic \
   --output-dir .
 ```
 
-Before moving to the next step (such as the command above), we recommend that you think about whether you want to filter the feature table by per-sample abundance. For example:
+Before moving to the next step (such as the command above), it is recommended to consider **filtering** the feature table by per-sample abundance. For example:
 
 ```bash
-qiime woltka filter-table \
+qiime woltka filter \
   --i-table table.qza \
   --min-percent 0.01 \
   --o-filtered-table filtered.qza

diff --git a/woltka/q2/_format.py b/woltka/q2/_format.py
@@ -35,7 +35,7 @@ def _validate_(self, level):
 
 
 SimpleMapDirFmt = model.SingleFileDirectoryFormat(
-    'SimpleMapDirFmt', 'alignment.map', SimpleMapFormat)
+    'SimpleMapDirFmt', 'mapping.txt', SimpleMapFormat)
 
 
 class NCBINodesFormat(model.TextFileFormat):

diff --git a/woltka/q2/_transformer.py b/woltka/q2/_transformer.py
@@ -29,7 +29,7 @@ def _2(ff: BLAST6OutDirFmt) -> str:
 
 @plugin.register_transformer
 def _3(ff: SimpleMapDirFmt) -> str:
-    return join(str(ff.path), 'alignment.map')
+    return join(str(ff.path), 'mapping.txt')
 
 
 @plugin.register_transformer

diff --git a/woltka/q2/plugin.py b/woltka/q2/plugin.py
@@ -25,9 +25,9 @@
 
 from woltka.workflow import build_mapper, classify as cwf
 from woltka.align import plain_mapper
-from woltka.file import read_map_1st
+from woltka.file import read_map_1st, read_map_many
 from woltka.table import prep_table
-from woltka.biom import table_to_biom, filter_biom
+from woltka.biom import table_to_biom, filter_biom, collapse_biom
 from woltka.tree import read_nodes, read_lineage, read_newick, fill_root
 from woltka.__init__ import __name__, __version__
 
@@ -124,9 +124,9 @@ def classify(alignment:             str,
     return table
 
 
-def filter_table(table:  biom.Table,
-                 min_count:     int = None,
-                 min_percent: float = None) -> biom.Table:
+def psfilter(table:  biom.Table,
+             min_count:     int = None,
+             min_percent: float = None) -> biom.Table:
     """Filter a feature table by per-sample abundance of features.
     """
     # validate parameters
@@ -147,3 +147,15 @@ def filter_table(table:  biom.Table,
     table.generated_by = f'{__name__}-{__version__}'
 
     return table
+
+
+def collapse(table: biom.Table,
+             mapping:      str,
+             normalize:   bool = False) -> biom.Table:
+    """Collapse a feature table based on many-to-many mapping.
+    """
+    with open(mapping, 'r') as fh:
+        mapping = read_map_many(fh)
+    table = collapse_biom(table, mapping, normalize)
+    table.generated_by = f'{__name__}-{__version__}'
+    return table
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4,7 +4,6 @@

		There is no restriction as far as we are aware of. Tested and working on Linux, macOS and Windows systems.


		## Software environment

		Woltka is written in Python 3. One needs at least Python 3.6 to run the program.
Expand Down