Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 1.79 KB

categories.md

File metadata and controls

27 lines (18 loc) · 1.79 KB
layout title
page
Sorting the unknown

We defined initially four categories of unknowns (might be more in the future) trying to combine an ecological and a protein domain based approach to their definition. The categories are defined as follow:

  • KNOWN: Our knowns are all those ORFs that contains a Pfam domain. We are developing an approach to assign function to the unknown ORFs that relies on Domain Co-cocurrence Networks and uses Pfam as a basic building block.

  • GENOMIC UNKNOWNS: The first categories of unknowns are those ORFs with unknown function but associated to a sequenced organism, or to population genomes (aka Metagenome Assembled Genomes).

  • ENVIRONMENTAL UNKNOWNS: The second category of unkwnowns are those ORFs with unknown function, which cannot be associated to an organism and are found only in environmental metagenomes

cl_categories.png

A bioinformatic workflow to structure the unknown functional space

We have implemented a bioinformatic workflow that performs the partitioning of genomic metagenomic datasets on the different categories of KNOWNS and UNKNOWNS.

methodology.png

We start from a de-novo clustering of all genomic and environmental genes and continue through a complex pipeline that validates and characterizes the gene clusters. For a more detailed explanation of the pipeline check how we create the protein clusters, how we do the validation and how we classify them.