Skip to content

Commit

Permalink
added metacyc
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyunzhu committed Feb 7, 2021
1 parent d9e0703 commit f121e23
Show file tree
Hide file tree
Showing 6 changed files with 133 additions and 46 deletions.
30 changes: 0 additions & 30 deletions doc/collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,33 +64,3 @@ Whether to enable normalization depends on the nature and aim of your analysis.
### Feature names

Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").


## Sample workflow

The following example performs functional profiling of shotgun metagenomic data using the [MetaCyc](https://metacyc.org/) database. Compared with the alternative [one-liner solution](wol.md#coordinates-based-functional-classification-using-metacyc), this solution allows one lower unit to be mapped to multiple upper units.

Run main classification workflow once to generate a profile of ORFs:

```bash
woltka classify -i input.sam --coords annotation/coords.txt.xz -o orf.tsv
```

Run profile collapsing multiple times and sequentially, each of which generating one additional profile:

```bash
# ORF to protein
woltka tools collapse -i orf.tsv -m $mcdir/protein.map -n $mcdir/protein.names -o protein.tsv
# protein to gene
woltka tools collapse -i protein.tsv -m $mcdir/protein2gene.map -n $mcdir/gene.names -o gene.tsv
# protein to enzymatic reaction
woltka tools collapse -i protein.tsv -m $mcdir/protein2enzrxn.map -n $mcdir/enzrxn.names -o enzrxn.tsv
# enzymatic reaction to reaction
woltka tools collapse -i enzrxn.tsv -m $mcdir/enzrxn2reaction.map -n $mcdir/reaction.names -o reaction.tsv
# reaction to EC number
woltka tools collapse -i reaction.tsv -m $mcdir/reaction2ec.map -o ec.tsv
# reaction to pathway
woltka tools collapse -i reaction.tsv -m $mcdir/reaction2pathway.map -n $mcdir/pathway.names -o pathway.tsv
# pathway to class
woltka tools collapse -i pathway.tsv -m $mcdir/pathway2class.map -n $mcdir/class.names -o class.tsv
```
2 changes: 1 addition & 1 deletion doc/hierarchy.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ With flag `--map-as-rank`, Woltka will extract a **rank** name from the filename
- `uniref.map` => `uniref`
- `prot2taxid.txt.gz` => `taxid`
- `reaction_to_pathway.tsv` => `pathway`
- `apple-to-orange` = > `orange`
- `apple-to-orange` => `orange`


## Multiple mapping
Expand Down
119 changes: 119 additions & 0 deletions doc/metacyc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Working with MetaCyc

MetaCyc (https://metacyc.org/) ([Caspi et al 2020](https://academic.oup.com/nar/article/48/D1/D445/5581728)) is a metabolic pathway database that has been widely used in genomic, metagenomic and metabolomic studies. It provides a hierarchical classification system, including genes, proteins, reactions, compounds, pathways and more.


## Contents

- [Mapping files](#mapping-files)
- [Protein profiling](#protein-profiling)
- [Genes, reactions, compounds and pathways](#genes,-reactions,-compounds-and-pathways)
- [Pathway coverage](#pathway-coverage)


## Mapping files

We mapped all ORFs from the [WoL](wol.md) reference genome database to the reference protein sequences in MetaCyc release 23.0. We provide this mapping file, as well as Woltka-compatible mapping and annotation files representing the higher levels in the MetaCyc classification system corresponding to the mapped WoL ORFs. These files are publicly available under the `annotation/metacyc/` directory of the [WoL data release](https://app.globus.org/file-manager/collections/31acbeb8-c62f-11ea-bef9-0e716405a293) ([see details](wol.md)).

We also included a UniRef-to-MetaCyc mapping file, extracted from the UniProt data release and subsetted to WoL. It contains less entries though.

The original, full MetaCyc database is available for download from the [official website](https://metacyc.org/). It includes the reference protein sequences with which one can perform custom alignments.


## Protein profiling

The following command utilizes Woltka's [ordinal classification](ordinal.md) function to classify sequences to **proteins** -- the entry point of the MetaCyc hierarchies.

```bash
woltka classify \
--input input_dir \
--coords coords.txt.xz \
--map wol-to-protein.txt.xz \
--names protein_name.txt \
--map-as-rank \
--rank protein \
--output protein.biom
```

* Note: The `--names` parameter is optional.
* Note: Replacing `.biom` with `.tsv` can generate TSV output.

Alternatively, one can split this command into two:

```bash
woltka classify -i input_dir -c coords.txt.xz -o orf.biom
woltka tools collapse -i orf.biom -m wol-to-protein.txt.xz -n protein_name.txt -o protein.biom
```


## Genes, reactions, compounds and pathways

The MetaCyc hierarcies are as follows:

```
v
go < protein > gene > pathway
v
regulation < enzrxn
v
ec < reaction > compound (left / right) > type
v
type < pathway > taxonomic range
v
super pathway
v
type
```

All transitions are enabled using Woltka's [collapse](collapse.md) command with individual mapping files. For example: one generate profiles along the following cascade:

```
protein - enzrxn - reaction - pathway - super pathway - type
```

Using the following commands:

```bash
# protein to enzrxn (enzymatic reaction):
woltka tools collapse -i protein.biom -m metacyc/protein-to-enzrxn.txt -n metacyc/enzrxn_name.txt -o enzrxn.biom
# enzrxn to reaction:
woltka tools collapse -i enzrxn.biom -m metacyc/enzrxn-to-reaction.txt -n metacyc/reaction_name.txt -o reaction.biom
# reaction to pathway:
woltka tools collapse -i reaction.biom -m metacyc/reaction-to-pathway.txt -n metacyc/pathway_name.txt -o pathway.biom
# pathway to super pathway:
woltka tools collapse -i pathway.biom -m metacyc/pathway-to-super_pathway.txt -n metacyc/pathway_name.txt -o super_pathway.biom
# super pathway (or pathway) to pathway type:
woltka tools collapse -i super_pathway.biom -m metacyc/pathway_type.txt -n metacyc/all_class_name.txt -o pathway_type.biom
```

The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain **compositionality** of the profile), one may consider adding the `--normalize` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)).


## Pathway coverage

It is usually important to assess how **completed**, in addition to how abundant a metabolic pathway is in a sample. This is because a pathway can only function when all components are present. Woltka's [coverage](coverage.md) command provides a solution ([see details](coverage.md)).

Among the mapping files there are two files recording the composition of metabolic pathways in terms of **genes** and **reactions**. We recommend using the reaction mapping because there are duplicated gene IDs.

```bash
woltka tools coverage -i reaction.biom -m pathway-to-reaction_list.txt -o pathway_coverage.biom
```

The output file is a **coverage** table in which every cell value represents the percentage of member reactions of a particular pathway present in a particular sample.


## Stratifying by taxonomy

Woltka allows overlaying two or more classification schemes on the same profile using the [stratification](stratify.md) function. With this function, one can identify functional genes and metabolic pathways that belong to individual taxonomic groups.

First, perform taxonomic classification at the desired rank (e.g., genus) and generate read-to-genus maps (using parameter `--outmap` or `-u`).

```bash
woltka classify -i input_dir --lineage lineage.txt -r genus -o genus.biom -u map_dir
```

Second, perform functional classification. This command is identical to the first command in this document, except for the addition of `--stratify` or `-t` parameter pointing to the genus maps, which will be incorporated into the functional classes ([see details](stratify.md)).

```bash
woltka classify -i input_dir -c coords.txt.xz -m wol-to-protein.txt.xz --map-as-rank -r protein -t map_dir -o protein.biom
```
2 changes: 1 addition & 1 deletion doc/ordinal.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Woltka considers a read-to-gene match if their **overlapping** region reaches a

## Example command

With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling (also see [here](collapse.md#sample-workflow)):
With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling:

```bash
woltka classify \
Expand Down
8 changes: 4 additions & 4 deletions doc/stratify.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Here is an example workflow, based on the commands and [sample data](../woltka/t

### Step 1: Taxonomic classification

```
```bash
woltka classify \
-i align/bowtie2 \
--map taxonomy/g2tid.txt \
Expand All @@ -33,7 +33,7 @@ The parameter `--outmap mapdir` will instruct Woltka to output read-to-taxon map

### Step 2: Combined taxonomic/functional classification

```
```bash
woltka classify \
-i align/bowtie2 \
--coords function/coords.txt.xz \
Expand Down Expand Up @@ -69,7 +69,7 @@ With this method, we start with the alignments of reads against reference **gene

The following example is also based on the [sample data](../woltka/tests/data). Here we used the gene alignment results computed by DIAMOND. The gene IDs are Prodigal-annotated ORFs from the reference genomes, in the format of `genome ID <underscore> index` (e.g., `G000123456_20`). With parameter `--trim-sub _`, Woltka converts them to genome IDs, and everything after is straightforward.

```
```bash
woltka classify \
-i align/diamond \
--trim-sub _ \
Expand All @@ -84,7 +84,7 @@ woltka classify \

Then, there is no need to provide the coordinates file during functional classification:

```
```bash
woltka classify \
-i align/diamond \
--map function/uniref.map.xz \
Expand Down
18 changes: 8 additions & 10 deletions doc/wol.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,27 +72,25 @@ woltka classify \
--output output.biom
```

## Coordinates-based functional classification using MetaCyc
## Coordinates-based functional classification

Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files.
Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files. The example below uses the MetaCyc system ([see details](metacyc.md)):

```bash
mcdir=annotation/metacyc
woltka classify \
--input input.sam \
--coords annotation/coords.txt.xz \
--map annotation/uniref.map.xz \
--map $mcdir/protein.map --names $mcdir/protein.names \
--map $mcdir/protein2enzrxn.map --names $mcdir/enzrxn.names \
--map $mcdir/enzrxn2reaction.map --names $mcdir/reaction.names \
--map $mcdir/reaction2pathway.map --names $mcdir/pathway.names \
--map $mcdir/pathway2class.map --names $mcdir/class.names \
--map $mcdir/wol-to-protein.txt.xz --names $mcdir/protein_name.txt \
--map $mcdir/protein-to-enzrxn.txt --names $mcdir/enzrxn_name.txt \
--map $mcdir/enzrxn-to-reaction.txt --names $mcdir/reaction_name.txt \
--map $mcdir/reaction-to-pathway.txt --names $mcdir/pathway_name.txt \
--map-as-rank \
--rank protein,enzrxn,reaction,pathway,class \
--rank protein,enzrxn,reaction,pathway \
--output output_dir
```

Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](collapse.md#sample-workflow).
Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](metacyc.md).

## Stratified taxonomic / functional classification

Expand Down

0 comments on commit f121e23

Please sign in to comment.