layout | title |
---|---|
page |
Sorting the unknown |
We defined initially four categories of unknowns (might be more in the future) trying to combine an ecological and a protein domain based approach to their definition. The categories are defined as follow:
-
KNOWN: Our knowns are all those ORFs that contains a Pfam domain. We are developing an approach to assign function to the unknown ORFs that relies on Domain Co-cocurrence Networks and uses Pfam as a basic building block.
-
GENOMIC UNKNOWNS: The first categories of unknowns are those ORFs with unknown function but associated to a sequenced organism, or to population genomes (aka Metagenome Assembled Genomes).
-
ENVIRONMENTAL UNKNOWNS: The second category of unkwnowns are those ORFs with unknown function, which cannot be associated to an organism and are found only in environmental metagenomes
We have implemented a bioinformatic workflow that performs the partitioning of genomic metagenomic datasets on the different categories of KNOWNS and UNKNOWNS.
We start from a de-novo clustering of all genomic and environmental genes and continue through a complex pipeline that validates and characterizes the gene clusters. For a more detailed explanation of the pipeline check how we create the protein clusters, how we do the validation and how we classify them.