From f121e23f8bbace55de2e83335ecc87743e77595e Mon Sep 17 00:00:00 2001
From: qiyunzhu <qiyunzhu@gmail.com>
Date: Sat, 6 Feb 2021 23:06:30 -0700
Subject: [PATCH] added metacyc

---
 doc/collapse.md  |  30 ------------
 doc/hierarchy.md |   2 +-
 doc/metacyc.md   | 119 +++++++++++++++++++++++++++++++++++++++++++++++
 doc/ordinal.md   |   2 +-
 doc/stratify.md  |   8 ++--
 doc/wol.md       |  18 ++++---
 6 files changed, 133 insertions(+), 46 deletions(-)
 create mode 100644 doc/metacyc.md

diff --git a/doc/collapse.md b/doc/collapse.md
index 2bd5c9a..d8c7904 100644
--- a/doc/collapse.md
+++ b/doc/collapse.md
@@ -64,33 +64,3 @@ Whether to enable normalization depends on the nature and aim of your analysis.
 ### Feature names
 
 Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").
-
-
-## Sample workflow
-
-The following example performs functional profiling of shotgun metagenomic data using the [MetaCyc](https://metacyc.org/) database. Compared with the alternative [one-liner solution](wol.md#coordinates-based-functional-classification-using-metacyc), this solution allows one lower unit to be mapped to multiple upper units.
-
-Run main classification workflow once to generate a profile of ORFs:
-
-```bash
-woltka classify -i input.sam --coords annotation/coords.txt.xz -o orf.tsv
-```
-
-Run profile collapsing multiple times and sequentially, each of which generating one additional profile:
-
-```bash
-# ORF to protein
-woltka tools collapse -i orf.tsv -m $mcdir/protein.map -n $mcdir/protein.names -o protein.tsv
-# protein to gene
-woltka tools collapse -i protein.tsv -m $mcdir/protein2gene.map -n $mcdir/gene.names -o gene.tsv
-# protein to enzymatic reaction
-woltka tools collapse -i protein.tsv -m $mcdir/protein2enzrxn.map -n $mcdir/enzrxn.names -o enzrxn.tsv
-# enzymatic reaction to reaction
-woltka tools collapse -i enzrxn.tsv -m $mcdir/enzrxn2reaction.map -n $mcdir/reaction.names -o reaction.tsv
-# reaction to EC number
-woltka tools collapse -i reaction.tsv -m $mcdir/reaction2ec.map -o ec.tsv
-# reaction to pathway
-woltka tools collapse -i reaction.tsv -m $mcdir/reaction2pathway.map -n $mcdir/pathway.names -o pathway.tsv
-# pathway to class
-woltka tools collapse -i pathway.tsv -m $mcdir/pathway2class.map -n $mcdir/class.names -o class.tsv
-```
diff --git a/doc/hierarchy.md b/doc/hierarchy.md
index e314ce4..e2d2ba9 100644
--- a/doc/hierarchy.md
+++ b/doc/hierarchy.md
@@ -220,7 +220,7 @@ With flag `--map-as-rank`, Woltka will extract a **rank** name from the filename
 - `uniref.map` => `uniref`
 - `prot2taxid.txt.gz` => `taxid`
 - `reaction_to_pathway.tsv` => `pathway`
-- `apple-to-orange` = > `orange`
+- `apple-to-orange` => `orange`
 
 
 ## Multiple mapping
diff --git a/doc/metacyc.md b/doc/metacyc.md
new file mode 100644
index 0000000..e2cfb18
--- /dev/null
+++ b/doc/metacyc.md
@@ -0,0 +1,119 @@
+# Working with MetaCyc
+
+MetaCyc (https://metacyc.org/) ([Caspi et al 2020](https://academic.oup.com/nar/article/48/D1/D445/5581728)) is a metabolic pathway database that has been widely used in genomic, metagenomic and metabolomic studies. It provides a hierarchical classification system, including genes, proteins, reactions, compounds, pathways and more.
+
+
+## Contents
+
+- [Mapping files](#mapping-files)
+- [Protein profiling](#protein-profiling)
+- [Genes, reactions, compounds and pathways](#genes,-reactions,-compounds-and-pathways)
+- [Pathway coverage](#pathway-coverage)
+
+
+## Mapping files
+
+We mapped all ORFs from the [WoL](wol.md) reference genome database to the reference protein sequences in MetaCyc release 23.0. We provide this mapping file, as well as Woltka-compatible mapping and annotation files representing the higher levels in the MetaCyc classification system corresponding to the mapped WoL ORFs. These files are publicly available under the `annotation/metacyc/` directory of the [WoL data release](https://app.globus.org/file-manager/collections/31acbeb8-c62f-11ea-bef9-0e716405a293) ([see details](wol.md)).
+
+We also included a UniRef-to-MetaCyc mapping file, extracted from the UniProt data release and subsetted to WoL. It contains less entries though.
+
+The original, full MetaCyc database is available for download from the [official website](https://metacyc.org/). It includes the reference protein sequences with which one can perform custom alignments.
+
+
+## Protein profiling
+
+The following command utilizes Woltka's [ordinal classification](ordinal.md) function to classify sequences to **proteins** -- the entry point of the MetaCyc hierarchies.
+
+```bash
+woltka classify \
+  --input input_dir \
+  --coords coords.txt.xz \
+  --map wol-to-protein.txt.xz \
+  --names protein_name.txt \
+  --map-as-rank \
+  --rank protein \
+  --output protein.biom
+```
+
+* Note: The `--names` parameter is optional.
+* Note: Replacing `.biom` with `.tsv` can generate TSV output.
+
+Alternatively, one can split this command into two:
+
+```bash
+woltka classify -i input_dir -c coords.txt.xz -o orf.biom
+woltka tools collapse -i orf.biom -m wol-to-protein.txt.xz -n protein_name.txt -o protein.biom
+```
+
+
+## Genes, reactions, compounds and pathways
+
+The MetaCyc hierarcies are as follows:
+
+```
+               v
+       go < protein > gene > pathway
+               v
+regulation < enzrxn
+               v
+       ec < reaction > compound (left / right) > type
+               v
+     type < pathway > taxonomic range
+               v
+         super pathway
+               v
+              type
+```
+
+All transitions are enabled using Woltka's [collapse](collapse.md) command with individual mapping files. For example: one generate profiles along the following cascade:
+
+```
+protein - enzrxn - reaction - pathway - super pathway - type
+```
+
+Using the following commands:
+
+```bash
+# protein to enzrxn (enzymatic reaction):
+woltka tools collapse -i protein.biom -m metacyc/protein-to-enzrxn.txt -n metacyc/enzrxn_name.txt -o enzrxn.biom
+# enzrxn to reaction:
+woltka tools collapse -i enzrxn.biom -m metacyc/enzrxn-to-reaction.txt -n metacyc/reaction_name.txt -o reaction.biom
+# reaction to pathway:
+woltka tools collapse -i reaction.biom -m metacyc/reaction-to-pathway.txt -n metacyc/pathway_name.txt -o pathway.biom
+# pathway to super pathway:
+woltka tools collapse -i pathway.biom -m metacyc/pathway-to-super_pathway.txt -n metacyc/pathway_name.txt -o super_pathway.biom
+# super pathway (or pathway) to pathway type:
+woltka tools collapse -i super_pathway.biom -m metacyc/pathway_type.txt -n metacyc/all_class_name.txt -o pathway_type.biom
+```
+
+The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain **compositionality** of the profile), one may consider adding the `--normalize` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)).
+
+
+## Pathway coverage
+
+It is usually important to assess how **completed**, in addition to how abundant a metabolic pathway is in a sample. This is because a pathway can only function when all components are present. Woltka's [coverage](coverage.md) command provides a solution ([see details](coverage.md)).
+
+Among the mapping files there are two files recording the composition of metabolic pathways in terms of **genes** and **reactions**. We recommend using the reaction mapping because there are duplicated gene IDs.
+
+```bash
+woltka tools coverage -i reaction.biom -m pathway-to-reaction_list.txt -o pathway_coverage.biom
+```
+
+The output file is a **coverage** table in which every cell value represents the percentage of member reactions of a particular pathway present in a particular sample.
+
+
+## Stratifying by taxonomy
+
+Woltka allows overlaying two or more classification schemes on the same profile using the [stratification](stratify.md) function. With this function, one can identify functional genes and metabolic pathways that belong to individual taxonomic groups.
+
+First, perform taxonomic classification at the desired rank (e.g., genus) and generate read-to-genus maps (using parameter `--outmap` or `-u`).
+
+```bash
+woltka classify -i input_dir --lineage lineage.txt -r genus -o genus.biom -u map_dir
+```
+
+Second, perform functional classification. This command is identical to the first command in this document, except for the addition of `--stratify` or `-t` parameter pointing to the genus maps, which will be incorporated into the functional classes ([see details](stratify.md)).
+
+```bash
+woltka classify -i input_dir -c coords.txt.xz -m wol-to-protein.txt.xz --map-as-rank -r protein -t map_dir -o protein.biom
+```
diff --git a/doc/ordinal.md b/doc/ordinal.md
index d5f5e76..4a6ad81 100644
--- a/doc/ordinal.md
+++ b/doc/ordinal.md
@@ -59,7 +59,7 @@ Woltka considers a read-to-gene match if their **overlapping** region reaches a
 
 ## Example command
 
-With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling (also see [here](collapse.md#sample-workflow)):
+With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling:
 
 ```bash
 woltka classify \
diff --git a/doc/stratify.md b/doc/stratify.md
index 4dd53dc..57b877e 100644
--- a/doc/stratify.md
+++ b/doc/stratify.md
@@ -17,7 +17,7 @@ Here is an example workflow, based on the commands and [sample data](../woltka/t
 
 ### Step 1: Taxonomic classification
 
-```
+```bash
 woltka classify \
   -i align/bowtie2 \
   --map taxonomy/g2tid.txt \
@@ -33,7 +33,7 @@ The parameter `--outmap mapdir` will instruct Woltka to output read-to-taxon map
 
 ### Step 2: Combined taxonomic/functional classification
 
-```
+```bash
 woltka classify \
   -i align/bowtie2 \
   --coords function/coords.txt.xz \
@@ -69,7 +69,7 @@ With this method, we start with the alignments of reads against reference **gene
 
 The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID <underscore> index` (e.g., `G000123456_20`). With parameter `--trim-sub _`, Woltka converts them to genome IDs, and everything after is straightforward.
 
-```
+```bash
 woltka classify \
   -i align/diamond \
   --trim-sub _ \
@@ -84,7 +84,7 @@ woltka classify \
 
 Then, there is no need to provide the coordinates file during functional classification:
 
-```
+```bash
 woltka classify \
   -i align/diamond \
   --map function/uniref.map.xz \
diff --git a/doc/wol.md b/doc/wol.md
index 43419c2..07838d9 100644
--- a/doc/wol.md
+++ b/doc/wol.md
@@ -72,27 +72,25 @@ woltka classify \
   --output output.biom
 ```
 
-## Coordinates-based functional classification using MetaCyc
+## Coordinates-based functional classification
 
-Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files.
+Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files. The example below uses the MetaCyc system ([see details](metacyc.md)):
 
 ```bash
 mcdir=annotation/metacyc
 woltka classify \
   --input input.sam \
   --coords annotation/coords.txt.xz \
-  --map annotation/uniref.map.xz \
-  --map $mcdir/protein.map --names $mcdir/protein.names \
-  --map $mcdir/protein2enzrxn.map --names $mcdir/enzrxn.names \
-  --map $mcdir/enzrxn2reaction.map --names $mcdir/reaction.names \
-  --map $mcdir/reaction2pathway.map --names $mcdir/pathway.names \
-  --map $mcdir/pathway2class.map --names $mcdir/class.names \
+  --map $mcdir/wol-to-protein.txt.xz --names $mcdir/protein_name.txt \
+  --map $mcdir/protein-to-enzrxn.txt --names $mcdir/enzrxn_name.txt \
+  --map $mcdir/enzrxn-to-reaction.txt --names $mcdir/reaction_name.txt \
+  --map $mcdir/reaction-to-pathway.txt --names $mcdir/pathway_name.txt \
   --map-as-rank \
-  --rank protein,enzrxn,reaction,pathway,class \
+  --rank protein,enzrxn,reaction,pathway \
   --output output_dir
 ```
 
-Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](collapse.md#sample-workflow).
+Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](metacyc.md).
 
 ## Stratified taxonomic / functional classification