Skip to content

Commit

Permalink
added q2 plugin and doc
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyunzhu committed Feb 6, 2021
1 parent 776f7cc commit f2ef68b
Show file tree
Hide file tree
Showing 21 changed files with 270 additions and 33 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,15 @@ Woltka ships with a **QIIME 2 plugin**. [See here for instructions](woltka/q2).
- [Overview](#overview)
- [Installation](#installation)
- [Example usage](#example-usage)
- Details
- Main workflow
- [Input files](doc/input.md)
- [Output files](doc/output.md)
- [Classification systems](doc/hierarchy.md)
- [Classification methods](doc/classify.md)
- [Coordinates matching](doc/ordinal.md)
- [Stratification](doc/stratify.md)
- Profile tools
- [Collapsing](doc/collapse.md)
- Tutorials
- [Working with WoL](doc/wol.md)
- [gOTU analysis](doc/gotu.md)
Expand Down
14 changes: 13 additions & 1 deletion doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Option | Description

### Filter

Filter a profile by per-sample abundance.
Filter a profile by **per-sample** abundance.

Option | Description
--- | ---
Expand All @@ -106,3 +106,15 @@ Option | Description
--- | ---
`--input`, `-i` (required) | Path to input profiles or directories containing profiles. Can accept multiple paths.
`--output`, `-o` (required) | Path to output profile.

### Collapse

Collapse a profile based on feature mapping (supports **many-to-many** mapping).

Option | Description
--- | ---
`--input`, `-i` (required) | Path to input profile.
`--map`, `-m` (required) | Path to mapping of source features to target features.
`--output`, `-o` (required) | Path to output profile.
`--normalize`, `-z` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
95 changes: 95 additions & 0 deletions doc/collapse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Profile collapsing

The **profile collapsing** function is a lightweight and flexible addition to the main classification workflow. It allows the user to convert an _existing_ profile based on a mapping of source features to target features. It highlights the support for **many-to-many mapping**.

```bash
woltka tools collapse -i input.biom -m mapping.txt -o output.biom
```

With this tool one can achieve the following goals:

1. Translate feature IDs into names or descriptions.
..* Example: Translate taxonomic IDs to taxon names.
..* Example: Translate [UniRef](https://www.uniprot.org/help/uniref) IDs to protein names, while **merging** same names.

2. Group lower features into higher categories.
..* Example: Convert genera into families.

3. Convert lower features into higher ones, where each lower feature may correspond to **multiple** higher features.
..* Example: Convert KEGG [orthologs](https://www.genome.jp/kegg/ko.html) to [pathways](https://www.genome.jp/kegg/pathway.html).
..* Example: Convert [GO](http://geneontology.org/docs/ontology-documentation/) terms to [GO Slim](http://www-legacy.geneontology.org/GO.slims.shtml) terms.

The last usage is an important complement to the main classification workflow, which currently relies on a tree structure and does not support one-to-many mapping. This can be achieved by using the profile collapsing function (although one can only move up one level per run).


## Mapping file format

A mapping file is a text file with entries separated by tabs. The number of fields per line is _arbitrary_. The first field is the source feature ID. The second to last fields are target feature ID(s). Duplicates in sources or targets are allowed. For examples:

1. One/many-to-one:
```
source1 <tab> target1
source2 <tab> target2
source3 <tab> target2
source4 <tab> target3
...
```

2. Many-to-many (multiple targets per line):
```
source1 <tab> target1
source2 <tab> target1 <tab> target2
source3 <tab> target2 <tab> target3 <tab> target4
...
```

3. Many-to-many (multiple same sources):
```
source1 <tab> target1
source1 <tab> target2
source2 <tab> target2
source3 <tab> target3
source4 <tab> target3
...
```

## Normalization

By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--normalize` or `-z` flag added to the command, each target will be counted 1 / _k_ times.

Whether to enable normalization depends on the nature and aim of your analysis. For example, one gene is involved in two pathways (which isn't uncommon), should each pathway be counted once, or half time?


## Feature names

Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will not be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").


## Sample workflow

The following example performs functional profiling of shotgun metagenomic data using the [MetaCyc](https://metacyc.org/) database. Compared with the alternative [one-liner solution](wol.md#coordinates-based-functional-classification-using-metacyc), this solution allows one lower unit to be mapped to multiple upper units.

Run main classification workflow once to generate a profile of ORFs:

```bash
woltka classify -i input.sam --coords annotation/coords.txt.xz -o orf.tsv
```

Run profile collapsing multiple times and sequentially, each of which generating one additional profile:

```bash
# ORF to protein
woltka tools collapse -i orf.tsv -m $mcdir/protein.map -n $mcdir/protein.names -o protein.tsv
# protein to gene
woltka tools collapse -i protein.tsv -m $mcdir/protein2gene.map -n $mcdir/gene.names -o gene.tsv
# protein to enzymatic reaction
woltka tools collapse -i protein.tsv -m $mcdir/protein2enzrxn.map -n $mcdir/enzrxn.names -o enzrxn.tsv
# enzymatic reaction to reaction
woltka tools collapse -i enzrxn.tsv -m $mcdir/enzrxn2reaction.map -n $mcdir/reaction.names -o reaction.tsv
# reaction to EC number
woltka tools collapse -i reaction.tsv -m $mcdir/reaction2ec.map -o ec.tsv
# reaction to pathway
woltka tools collapse -i reaction.tsv -m $mcdir/reaction2pathway.map -n $mcdir/pathway.names -o pathway.tsv
# pathway to class
woltka tools collapse -i pathway.tsv -m $mcdir/pathway2class.map -n $mcdir/class.names -o class.tsv
```
6 changes: 6 additions & 0 deletions doc/hierarchy.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ That being said, Woltka still supports ranked hierarchies and one can instruct t
- [Taxon name dictionary](#taxon-name-dictionary)
- [How Woltka handles hierarchies](#how-woltka-handles-hierarchies)
- [Hierarchy file specs](#hierarchy-file-specifications)
- [Multiple mapping](#multiple-mapping)


## Supported hierarchy files
Expand Down Expand Up @@ -220,3 +221,8 @@ With flag `--map-as-rank`, Woltka will extract a **rank** name from the filename
- `prot2taxid.txt.gz` => `taxid`
- `reaction_to_pathway.tsv` => `pathway`
- `apple-to-orange` = > `orange`


## Multiple mapping

As explained above, Woltka's main classification workflow relies on a tree-structured hierarchy, i.e., a classification unit can only point to one parent classification unit. However, in some scenarios one may want to perform one-to-many mapping. Examples include functional analyses, where one gene may be involved in multiple pathways. Woltka provides a [profile collapsing](collapse.md) tool to enable this analysis ([see details](collapse.md)).
1 change: 0 additions & 1 deletion doc/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ The input files for Woltka are sequence **alignment** files. The term "alignment
- [Demultiplexing](#demultiplexing)
- [Subject trimming](#subject-trimming)


## Input filepath

Parameter `--input` or `-i` is to let Woltka know where to find the input alignment file(s). It can be any of the following three scenarios:
Expand Down
1 change: 0 additions & 1 deletion doc/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

There is no restriction as far as we are aware of. Tested and working on Linux, macOS and Windows systems.


## Software environment

Woltka is written in **Python 3**. One needs at least Python 3.6 to run the program.
Expand Down
2 changes: 1 addition & 1 deletion doc/ordinal.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Woltka considers a read-to-gene match if their **overlapping** region reaches a

## Example command

With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling:
With the coordinates file, one can streamline the read-to-gene matching step into a Woltka protocol. Here is an example for functional profiling (also see [here](collapse.md#sample-workflow)):

```bash
woltka classify \
Expand Down
11 changes: 10 additions & 1 deletion doc/output.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Output files

The main output file(s) of Woltka are **feature tables**, in which column headers are sample IDs and row headers are feature IDs, and each cell value represents the _count_ of a particular feature in a particular sample. For example:
The main output file(s) of Woltka are **feature tables** (a.k.a., **profiles**), in which column headers are sample IDs and row headers are feature IDs, and each cell value represents the _count_ of a particular feature in a particular sample. For example:

Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Name | Rank
--- | --- | --- | --- | --- | --- | ---
Expand All @@ -15,6 +15,7 @@ Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Name | Rank
- [Output filepath](#output-filepath)
- [File formats](#file-formats)
- [Output read maps](#output-read-maps)
- [Table utilities](#table-utilities)

## Output filepath

Expand Down Expand Up @@ -132,3 +133,11 @@ woltka classify \
```

In `outmap_dir`, there will be three subdirectories: `phylum`, `genus` and `species`, eaching holding three read map files: `S1.txt.xz`, `S2.txt.xz` and `S3.txt.xz`.

## Table utilities

Woltka provides several utilities under the `tools` menu for table manipulation (both BIOM and TSV are supported and automatically recognized). Here are details:

- [**collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
- **filter**: Filter a profile by per-sample abundance.
- **merge**: Merge multiple profiles into one profile.
4 changes: 4 additions & 0 deletions doc/wol.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ woltka classify \

## Coordinates-based functional classification using MetaCyc

Associate read alignments with ORFs, and move up multiple functional levels (protein, reaction, pathway...) using a cascade of mapping files.

```bash
mcdir=annotation/metacyc
woltka classify \
Expand All @@ -90,6 +92,8 @@ woltka classify \
--output output_dir
```

Note: This command won't handle multiple mapping (e.g., one protein involved in three pathways). A capable solution is provided [here](collapse.md#sample-workflow).

## Stratified taxonomic / functional classification

Say, you want to stratify functional annotations by genus (taxonomy). First, run taxonomic classification at the genus level, and export read-to-genus maps:
Expand Down
9 changes: 4 additions & 5 deletions woltka/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,20 +243,19 @@ def merge_cmd(ctx, **kwargs):
@click.option(
'--map', '-m', 'map_fp', required=True,
type=click.Path(exists=True, dir_okay=False),
help=('Mapping of lower classification units to higher ones (supports '
help=('Mapping of source features to target features. (supports '
'many-to-many relationships).'))
@click.option(
'--output', '-o', 'output_fp', required=True,
type=click.Path(writable=True, dir_okay=False),
help='Path to output profile.')
@click.option(
'--normalize', '-z', is_flag=True,
help=('Count each higher classification unit as 1/k (k is the number of '
'higher classification units mapped to a lower one). Otherwise, '
'count as one.'))
help=('Count each target feature as 1/k (k is the number of targets '
'mapped to a source). Otherwise, count as one.'))
@click.option(
'--names', '-n', 'names_fp', type=click.Path(exists=True),
help='Names of higher classification units to append to output profile.')
help='Names of target features to append to the output profile.')
@click.pass_context
def collapse_cmd(ctx, **kwargs):
"""Collapse a profile based on feature mapping.
Expand Down
40 changes: 40 additions & 0 deletions woltka/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,46 @@ def read_map_all(fh, sep='\t'):
yield key, rest.rstrip().split(sep)


def read_map_many(fh, sep='\t'):
"""Read many-to-many relationships from a mapping file.
Parameters
----------
fh : file handle
Mapping file.
Returns
-------
dict of list
Key-value(s) pair.
Notes
-----
This is a general solution which applies to the following formats:
1 (one-to-one):
source1 <tab> target1
source2 <tab> target2
source3 <tab> target3
...
2 (one-to-many):
source1 <tab> target1
source2 <tab> target1 <tab> target2
source3 <tab> target2 <tab> target3 <tab> target4
...
3 (many-to-many):
source1 <tab> target1
source1 <tab> target2
source2 <tab> target2
source3 <tab> target3
source4 <tab> target3
...
"""
res = {}
for key, values in read_map_all(fh, sep):
res.setdefault(key, []).extend(values)
return res


def write_readmap(fh, qryque, taxque, namedic=None):
"""Write a read map to a tab-delimited file.
Expand Down
4 changes: 2 additions & 2 deletions woltka/q2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,10 @@ qiime diversity core-metrics-phylogenetic \
--output-dir .
```

Before moving to the next step (such as the command above), we recommend that you think about whether you want to filter the feature table by per-sample abundance. For example:
Before moving to the next step (such as the command above), it is recommended to consider **filtering** the feature table by per-sample abundance. For example:

```bash
qiime woltka filter-table \
qiime woltka filter \
--i-table table.qza \
--min-percent 0.01 \
--o-filtered-table filtered.qza
Expand Down
2 changes: 1 addition & 1 deletion woltka/q2/_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def _validate_(self, level):


SimpleMapDirFmt = model.SingleFileDirectoryFormat(
'SimpleMapDirFmt', 'alignment.map', SimpleMapFormat)
'SimpleMapDirFmt', 'mapping.txt', SimpleMapFormat)


class NCBINodesFormat(model.TextFileFormat):
Expand Down
2 changes: 1 addition & 1 deletion woltka/q2/_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def _2(ff: BLAST6OutDirFmt) -> str:

@plugin.register_transformer
def _3(ff: SimpleMapDirFmt) -> str:
return join(str(ff.path), 'alignment.map')
return join(str(ff.path), 'mapping.txt')


@plugin.register_transformer
Expand Down
22 changes: 17 additions & 5 deletions woltka/q2/plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@

from woltka.workflow import build_mapper, classify as cwf
from woltka.align import plain_mapper
from woltka.file import read_map_1st
from woltka.file import read_map_1st, read_map_many
from woltka.table import prep_table
from woltka.biom import table_to_biom, filter_biom
from woltka.biom import table_to_biom, filter_biom, collapse_biom
from woltka.tree import read_nodes, read_lineage, read_newick, fill_root
from woltka.__init__ import __name__, __version__

Expand Down Expand Up @@ -124,9 +124,9 @@ def classify(alignment: str,
return table


def filter_table(table: biom.Table,
min_count: int = None,
min_percent: float = None) -> biom.Table:
def psfilter(table: biom.Table,
min_count: int = None,
min_percent: float = None) -> biom.Table:
"""Filter a feature table by per-sample abundance of features.
"""
# validate parameters
Expand All @@ -147,3 +147,15 @@ def filter_table(table: biom.Table,
table.generated_by = f'{__name__}-{__version__}'

return table


def collapse(table: biom.Table,
mapping: str,
normalize: bool = False) -> biom.Table:
"""Collapse a feature table based on many-to-many mapping.
"""
with open(mapping, 'r') as fh:
mapping = read_map_many(fh)
table = collapse_biom(table, mapping, normalize)
table.generated_by = f'{__name__}-{__version__}'
return table
Loading

0 comments on commit f2ef68b

Please sign in to comment.