Skip to content

Latest commit

 

History

History
237 lines (148 loc) · 10.7 KB

17.1_Cluster_and_communities_overview.md

File metadata and controls

237 lines (148 loc) · 10.7 KB
layout title
page
Cluster categories (an overview)

The cluster categories:

Knowns with PFAM (Ks): ORF clusters that have been annotated with a PFAM domains of known function.

Knowns without PFAMs (KWPs): clusters that have a known function, but do not contain PFAM annotations.

Genomic Unknowns (GUs): ORF clusters that have an unknown function (e.g. DUF, hypothetical protein) but are found in sequenced or draft-genomes, or in population genomes (or Metagenome Assembled Genomes).

Environmental Unknowns (EUs): ORF clusters of unknown function that are not found in sequenced or draft genomes, but only in environmental metagenomes.

cl_categories.png

Cluster categories overview.

Gene clusters and cluster communities

The following table shows the number of kept genes, gene cluster and cluster communities obtained from the combination of the metagenomic and genomic DBs.

NB: Part of the GTDB clusters were found in the MG cluster communities, the rest was then aggregated in new cluster communities. The combined results are shown in table below.

Cluster and cluster community categories:

K KWP GU EU Total
Communities 62,300 91,742 416,364 103,195 673,601
Clusters 1,667,510 768,859 2,647,359 204,031 5,287,759
ORFs 232,895,994 32,930,286 68,757,918 3,541,592 338,125,790

Cluster category main statisitcs

Cluster length
Cluster size
Cluster completeness

We retrieved the percentage of completeness for each cluster based on the percentage of complete ORFs (ORFs labeled by Prodigal [1] with "00" in the gene prediction step).

High quality (HQ) set of clusters

Using the completness information we retrieved a set of HQ clusters in terms of percentage of complete ORFs and the presence of a complete representative. The cluster representatives are those retrieved during the compositional validation step (see Cluster validation and refinement paragraph). To determine the clusters that are part of the HQ set, we first applied the broken-stick model [3] to determine a minimum required percentage of complete ORFs per cluster. Then, from the set of clusters above the threshold, we selected only the clusters with a complete representative.

High Quality clusters

Category HQ cluster HQ ORFs pHQ_cl pHQ_orfs
K 76,718 40,710,936 0.0145 0.120
KWP 16,922 1,733,599 0.00320 0.005132
GU 95,370 9,908,630 0.0180 0.0293
EU 14,207 477,625 0.00269 0.00141
Total 203,217 52,830,790 0.0384 0.1562

As shown in the above table, the category with the highest percentage of HQ, i.e. complete, clusters is that of the EUs with 10% HQ clusters, followed by GUs and Ks. The KWPs have the least complete clusters and as showed in the previous section the highest level of (protein) disorder.

Level of darkness and disorder

The level of darkness is calculated as the percentage of dark, i.e unknown, regions in each ORFs in the clusters, based on the entries of the Dark Proteome Database (DPD), a structural-based database containing information about the molecular conformation of protein regions [2].

Mean level of darkness and disorder for each cluster category, based on the DPD data. The average level per category was obtained calculating the mean of each cluster percentage of darkness and disorder, which is based on the values retrieved for each ORF. We didn't retrieve any darkness information about the EUs (they were not found in the DPD database). The other categories show a degree of darkness inversely proportional to their functional characterisation. The highest level of disorder instead was found in the KWP clusters.

Number of GCs annotated to the DPD per functional category

K KWP GU EU
Annotated clusters 237,511 7,205 8,688 0

Level of darkness and disorder per category

K KWP GU EU
Mean darkness 0.13 0.33 0.54 NF
Mean disorder 0.050 0.071 0.062 NF

Taxonomy (and cluster taxonomic homogeneity)

Number of metagenomic clusters and ORFs with taxonomic annotations (MMseqs2)

K KWP GU EU
Clusters 1,038,296 (99%) 607,250 (96%) 962,929 (86%) 21,863 (16%)
ORFs 145,940,358 (85%) 26,179,191 (85%) 41,743,739 (77%) 529,320 (16%)

5. General cluster statistics

Minimum Mean Median Maximum SD
Cluster size 2 63.94 13 168,822 477.61
Cluster gene length 20 194.64 135 27,314 0.96
Cluster completion 0 0.55 0.76 1 0.45
Cluster phylum entropy 0 0.32 0 5.14 0.59
Cluster darkness 0 0.03 0 1 0.16

6. General cluster statistics grouped by cluster category

Cluster size:

Cluster size Minimum Mean Median Maximum SD
K 2 139.67 21 168,822 829.99
KWP 2 42.83 17 12,339 126.72
GU 2 25.97 8 17,624 107.62
EU 2 17.36 12 6,196 36.05

Cluster gene length:

Cluster gene length Minimum Mean Median Maximum SD
K 20 258.55 187 21,337 0.95
KWP 20 133.22 93 24,979 0.96
GU 20 177.16 124 27,314 0.96
EU 20 130.65 96 10,373 0.96

Cluster completion:

Cluster completion Minimum Mean Median Maximum SD
K 0 0.50 0.36 1 0.44
KWP 0 0.22 0.013 1 0.36
GU 0 0.68 1 1 0.42
EU 0 0.70 0.90 1 0.39

Cluster phylum entropy:

Cluster phylum entropy Minimum Mean Median Maximum SD
K 0 0.53 0 5.13 0.73
KWP 0 0.38 0 5.0 0.56
GU 0 0.17 0 4.80 0.43
EU 0 0.05 0 2.49 0.24

Cluster darkness:

Cluster darkness Minimum Mean Median Maximum SD
K 0 0.13 0.05 1 0.23
KWP 0 0.33 0.15 1 0.35
GU 0 0.54 0.47 1 0.43
EU NF NF NF NF NF

Cluster disorder:

Cluster disorder Minimum Mean Median Maximum SD
K 0 0.050 0.02 1 0.087
KWP 0 0.071 0.03 1 0.012
GU 0 0.062 0.01 1 0.012
EU NF NF NF NF NF

7. Taxonomic entropy summary

Mean entropy + SD

Rank K KWP GU EU global
Domain 0.14 +0.27 0.17 +0.33 0.06 +0.22 0.03 +0.16 0.10 +0.26
Phylum 0.53 +0.73 0.38 +0.56 0.17 +0.42 0.05 +0.24 0.31 +0.59
Class 0.67 +0.84 0.48 +0.62 0.20 +0.47 0.05 +0.21 0.40 +0.67
Order 0.87 +1.01 0.50 +0.66 0.28 +0.57 0.06 +0.25 0.50 +0.80
Family 1.07 +1.17 0.62 +0.74 0.36 +0.67 0.06 +0.25 0.63 +0.93
Genus 1.38 +1.47 0.68 +0.82 0.54 +0.88 0.09 +0.30 0.83 +1.17
Species 1.67 +1.44 1.16 +0.99 0.98 +1.05 0.21 +0.45 1.23 +1.23

References

[1] Hyatt, Doug, Gwo-Liang Chen, Philip F. LoCascio, Miriam L. Land, Frank W. Larimer, and Loren J. Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification.” BMC Bioinformatics 11 (1): 119–119.

[2] Perdigão, Nelson, Agostinho C. Rosa, and Seán I. O’Donoghue. 2017. “The Dark Proteome Database.” BioData Mining 10 (1): 1–11.

[3] Bennett, K. D. 1996. “Determination of the Number of Zones in a Biostratigraphical Sequence.” The New Phytologist 132 (1): 155–70.