From f121e23f8bbace55de2e83335ecc87743e77595e Mon Sep 17 00:00:00 2001 From: qiyunzhu Date: Sat, 6 Feb 2021 23:06:30 -0700 Subject: [PATCH] added metacyc --- doc/collapse.md | 30 ------------ doc/hierarchy.md | 2 +- doc/metacyc.md | 119 +++++++++++++++++++++++++++++++++++++++++++++++ doc/ordinal.md | 2 +- doc/stratify.md | 8 ++-- doc/wol.md | 18 ++++--- 6 files changed, 133 insertions(+), 46 deletions(-) create mode 100644 doc/metacyc.md diff --git a/doc/collapse.md b/doc/collapse.md index 2bd5c9a..d8c7904 100644 --- a/doc/collapse.md +++ b/doc/collapse.md @@ -64,33 +64,3 @@ Whether to enable normalization depends on the nature and aim of your analysis. ### Feature names Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name"). - - -## Sample workflow - -The following example performs functional profiling of shotgun metagenomic data using the [MetaCyc](https://metacyc.org/) database. Compared with the alternative [one-liner solution](wol.md#coordinates-based-functional-classification-using-metacyc), this solution allows one lower unit to be mapped to multiple upper units. - -Run main classification workflow once to generate a profile of ORFs: - -```bash -woltka classify -i input.sam --coords annotation/coords.txt.xz -o orf.tsv -``` - -Run profile collapsing multiple times and sequentially, each of which generating one additional profile: - -```bash -# ORF to protein -woltka tools collapse -i orf.tsv -m $mcdir/protein.map -n $mcdir/protein.names -o protein.tsv -# protein to gene -woltka tools collapse -i protein.tsv -m $mcdir/protein2gene.map -n $mcdir/gene.names -o gene.tsv -# protein to enzymatic reaction -woltka tools collapse -i protein.tsv -m $mcdir/protein2enzrxn.map -n $mcdir/enzrxn.names -o enzrxn.tsv -# enzymatic reaction to reaction -woltka tools collapse -i enzrxn.tsv -m $mcdir/enzrxn2reaction.map -n $mcdir/reaction.names -o reaction.tsv -# reaction to EC number -woltka tools collapse -i reaction.tsv -m $mcdir/reaction2ec.map -o ec.tsv -# reaction to pathway -woltka tools collapse -i reaction.tsv -m $mcdir/reaction2pathway.map -n $mcdir/pathway.names -o pathway.tsv -# pathway to class -woltka tools collapse -i pathway.tsv -m $mcdir/pathway2class.map -n $mcdir/class.names -o class.tsv -``` diff --git a/doc/hierarchy.md b/doc/hierarchy.md index e314ce4..e2d2ba9 100644 --- a/doc/hierarchy.md +++ b/doc/hierarchy.md @@ -220,7 +220,7 @@ With flag `--map-as-rank`, Woltka will extract a **rank** name from the filename - `uniref.map` => `uniref` - `prot2taxid.txt.gz` => `taxid` - `reaction_to_pathway.tsv` => `pathway` -- `apple-to-orange` = > `orange` +- `apple-to-orange` => `orange` ## Multiple mapping diff --git a/doc/metacyc.md b/doc/metacyc.md new file mode 100644 index 0000000..e2cfb18 --- /dev/null +++ b/doc/metacyc.md @@ -0,0 +1,119 @@ +# Working with MetaCyc + +MetaCyc (https://metacyc.org/) ([Caspi et al 2020](https://academic.oup.com/nar/article/48/D1/D445/5581728)) is a metabolic pathway database that has been widely used in genomic, metagenomic and metabolomic studies. It provides a hierarchical classification system, including genes, proteins, reactions, compounds, pathways and more. + + +## Contents + +- [Mapping files](#mapping-files) +- [Protein profiling](#protein-profiling) +- [Genes, reactions, compounds and pathways](#genes,-reactions,-compounds-and-pathways) +- [Pathway coverage](#pathway-coverage) + + +## Mapping files + +We mapped all ORFs from the [WoL](wol.md) reference genome database to the reference protein sequences in MetaCyc release 23.0. We provide this mapping file, as well as Woltka-compatible mapping and annotation files representing the higher levels in the MetaCyc classification system corresponding to the mapped WoL ORFs. These files are publicly available under the `annotation/metacyc/` directory of the [WoL data release](https://app.globus.org/file-manager/collections/31acbeb8-c62f-11ea-bef9-0e716405a293) ([see details](wol.md)). + +We also included a UniRef-to-MetaCyc mapping file, extracted from the UniProt data release and subsetted to WoL. It contains less entries though. + +The original, full MetaCyc database is available for download from the [official website](https://metacyc.org/). It includes the reference protein sequences with which one can perform custom alignments. + + +## Protein profiling + +The following command utilizes Woltka's [ordinal classification](ordinal.md) function to classify sequences to **proteins** -- the entry point of the MetaCyc hierarchies. + +```bash +woltka classify \ + --input input_dir \ + --coords coords.txt.xz \ + --map wol-to-protein.txt.xz \ + --names protein_name.txt \ + --map-as-rank \ + --rank protein \ + --output protein.biom +``` + +* Note: The `--names` parameter is optional. +* Note: Replacing `.biom` with `.tsv` can generate TSV output. + +Alternatively, one can split this command into two: + +```bash +woltka classify -i input_dir -c coords.txt.xz -o orf.biom +woltka tools collapse -i orf.biom -m wol-to-protein.txt.xz -n protein_name.txt -o protein.biom +``` + + +## Genes, reactions, compounds and pathways + +The MetaCyc hierarcies are as follows: + +``` + v + go < protein > gene > pathway + v +regulation < enzrxn + v + ec < reaction > compound (left / right) > type + v + type < pathway > taxonomic range + v + super pathway + v + type +``` + +All transitions are enabled using Woltka's [collapse](collapse.md) command with individual mapping files. For example: one generate profiles along the following cascade: + +``` +protein - enzrxn - reaction - pathway - super pathway - type +``` + +Using the following commands: + +```bash +# protein to enzrxn (enzymatic reaction): +woltka tools collapse -i protein.biom -m metacyc/protein-to-enzrxn.txt -n metacyc/enzrxn_name.txt -o enzrxn.biom +# enzrxn to reaction: +woltka tools collapse -i enzrxn.biom -m metacyc/enzrxn-to-reaction.txt -n metacyc/reaction_name.txt -o reaction.biom +# reaction to pathway: +woltka tools collapse -i reaction.biom -m metacyc/reaction-to-pathway.txt -n metacyc/pathway_name.txt -o pathway.biom +# pathway to super pathway: +woltka tools collapse -i pathway.biom -m metacyc/pathway-to-super_pathway.txt -n metacyc/pathway_name.txt -o super_pathway.biom +# super pathway (or pathway) to pathway type: +woltka tools collapse -i super_pathway.biom -m metacyc/pathway_type.txt -n metacyc/all_class_name.txt -o pathway_type.biom +``` + +The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain **compositionality** of the profile), one may consider adding the `--normalize` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)). + + +## Pathway coverage + +It is usually important to assess how **completed**, in addition to how abundant a metabolic pathway is in a sample. This is because a pathway can only function when all components are present. Woltka's [coverage](coverage.md) command provides a solution ([see details](coverage.md)). + +Among the mapping files there are two files recording the composition of metabolic pathways in terms of **genes** and **reactions**. We recommend using the reaction mapping because there are duplicated gene IDs. + +```bash +woltka tools coverage -i reaction.biom -m pathway-to-reaction_list.txt -o pathway_coverage.biom +``` + +The output file is a **coverage** table in which every cell value represents the percentage of member reactions of a particular pathway present in a particular sample. + + +## Stratifying by taxonomy + +Woltka allows overlaying two or more classification schemes on the same profile using the [stratification](stratify.md) function. With this function, one can identify functional genes and metabolic pathways that belong to individual taxonomic groups. + +First, perform taxonomic classification at the desired rank (e.g., genus) and generate read-to-genus maps (using parameter `--outmap` or `-u`). + +```bash +woltka classify -i input_dir --lineage lineage.txt -r genus -o genus.biom -u map_dir +``` + +Second, perform functional classification. This command is identical to the first command in this document, except for the addition of `--stratify` or `-t` parameter pointing to the genus maps, which will be incorporated into the functional classes ([see details](stratify.md)). + +```bash +woltka classify -i input_dir -c coords.txt.xz -m wol-to-protein.txt.xz --map-as-rank -r protein -t map_dir -o protein.biom +``` diff --git a/doc/ordinal.md b/doc/ordinal.md index d5f5e76..4a6ad81 100644 --- a/doc/ordinal.md +++ b/doc/ordinal.md @@ -59,7 +59,7 @@ Woltka considers a read-to-gene match if their **overlapping** region reaches a ## Example command -With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling (also see [here](collapse.md#sample-workflow)): +With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling: ```bash woltka classify \ diff --git a/doc/stratify.md b/doc/stratify.md index 4dd53dc..57b877e 100644 --- a/doc/stratify.md +++ b/doc/stratify.md @@ -17,7 +17,7 @@ Here is an example workflow, based on the commands and [sample data](../woltka/t ### Step 1: Taxonomic classification -``` +```bash woltka classify \ -i align/bowtie2 \ --map taxonomy/g2tid.txt \ @@ -33,7 +33,7 @@ The parameter `--outmap mapdir` will instruct Woltka to output read-to-taxon map ### Step 2: Combined taxonomic/functional classification -``` +```bash woltka classify \ -i align/bowtie2 \ --coords function/coords.txt.xz \ @@ -69,7 +69,7 @@ With this method, we start with the alignments of reads against reference **gene The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID index` (e.g., `G000123456_20`). With parameter `--trim-sub _`, Woltka converts them to genome IDs, and everything after is straightforward. -``` +```bash woltka classify \ -i align/diamond \ --trim-sub _ \ @@ -84,7 +84,7 @@ woltka classify \ Then, there is no need to provide the coordinates file during functional classification: -``` +```bash woltka classify \ -i align/diamond \ --map function/uniref.map.xz \ diff --git a/doc/wol.md b/doc/wol.md index 43419c2..07838d9 100644 --- a/doc/wol.md +++ b/doc/wol.md @@ -72,27 +72,25 @@ woltka classify \ --output output.biom ``` -## Coordinates-based functional classification using MetaCyc +## Coordinates-based functional classification -Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files. +Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files. The example below uses the MetaCyc system ([see details](metacyc.md)): ```bash mcdir=annotation/metacyc woltka classify \ --input input.sam \ --coords annotation/coords.txt.xz \ - --map annotation/uniref.map.xz \ - --map $mcdir/protein.map --names $mcdir/protein.names \ - --map $mcdir/protein2enzrxn.map --names $mcdir/enzrxn.names \ - --map $mcdir/enzrxn2reaction.map --names $mcdir/reaction.names \ - --map $mcdir/reaction2pathway.map --names $mcdir/pathway.names \ - --map $mcdir/pathway2class.map --names $mcdir/class.names \ + --map $mcdir/wol-to-protein.txt.xz --names $mcdir/protein_name.txt \ + --map $mcdir/protein-to-enzrxn.txt --names $mcdir/enzrxn_name.txt \ + --map $mcdir/enzrxn-to-reaction.txt --names $mcdir/reaction_name.txt \ + --map $mcdir/reaction-to-pathway.txt --names $mcdir/pathway_name.txt \ --map-as-rank \ - --rank protein,enzrxn,reaction,pathway,class \ + --rank protein,enzrxn,reaction,pathway \ --output output_dir ``` -Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](collapse.md#sample-workflow). +Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](metacyc.md). ## Stratified taxonomic / functional classification