layout | title |
---|---|
page |
Cluster categories (an overview) |
The cluster categories:
Knowns with PFAM (Ks): ORF clusters that have been annotated with a PFAM domains of known function.
Knowns without PFAMs (KWPs): clusters that have a known function, but do not contain PFAM annotations.
Genomic Unknowns (GUs): ORF clusters that have an unknown function (e.g. DUF, hypothetical protein) but are found in sequenced or draft-genomes, or in population genomes (or Metagenome Assembled Genomes).
Environmental Unknowns (EUs): ORF clusters of unknown function that are not found in sequenced or draft genomes, but only in environmental metagenomes.
Cluster categories overview.
The following table shows the number of kept genes, gene cluster and cluster communities obtained from the combination of the metagenomic and genomic DBs.
NB: Part of the GTDB clusters were found in the MG cluster communities, the rest was then aggregated in new cluster communities. The combined results are shown in table below.
Cluster and cluster community categories:
K | KWP | GU | EU | Total | |
---|---|---|---|---|---|
Communities | 62,300 | 91,742 | 416,364 | 103,195 | 673,601 |
Clusters | 1,667,510 | 768,859 | 2,647,359 | 204,031 | 5,287,759 |
ORFs | 232,895,994 | 32,930,286 | 68,757,918 | 3,541,592 | 338,125,790 |
We retrieved the percentage of completeness for each cluster based on the percentage of complete ORFs (ORFs labeled by Prodigal [1] with "00" in the gene prediction step).
Using the completness information we retrieved a set of HQ clusters in terms of percentage of complete ORFs and the presence of a complete representative. The cluster representatives are those retrieved during the compositional validation step (see Cluster validation and refinement paragraph). To determine the clusters that are part of the HQ set, we first applied the broken-stick model [3] to determine a minimum required percentage of complete ORFs per cluster. Then, from the set of clusters above the threshold, we selected only the clusters with a complete representative.
High Quality clusters
Category | HQ cluster | HQ ORFs | pHQ_cl | pHQ_orfs |
---|---|---|---|---|
K | 76,718 | 40,710,936 | 0.0145 | 0.120 |
KWP | 16,922 | 1,733,599 | 0.00320 | 0.005132 |
GU | 95,370 | 9,908,630 | 0.0180 | 0.0293 |
EU | 14,207 | 477,625 | 0.00269 | 0.00141 |
Total | 203,217 | 52,830,790 | 0.0384 | 0.1562 |
As shown in the above table, the category with the highest percentage of HQ, i.e. complete, clusters is that of the EUs with 10% HQ clusters, followed by GUs and Ks. The KWPs have the least complete clusters and as showed in the previous section the highest level of (protein) disorder.
The level of darkness is calculated as the percentage of dark, i.e unknown, regions in each ORFs in the clusters, based on the entries of the Dark Proteome Database (DPD), a structural-based database containing information about the molecular conformation of protein regions [2].
Mean level of darkness and disorder for each cluster category, based on the DPD data. The average level per category was obtained calculating the mean of each cluster percentage of darkness and disorder, which is based on the values retrieved for each ORF. We didn't retrieve any darkness information about the EUs (they were not found in the DPD database). The other categories show a degree of darkness inversely proportional to their functional characterisation. The highest level of disorder instead was found in the KWP clusters.
Number of GCs annotated to the DPD per functional category
K | KWP | GU | EU | |
---|---|---|---|---|
Annotated clusters | 237,511 | 7,205 | 8,688 | 0 |
Level of darkness and disorder per category
K | KWP | GU | EU | |
---|---|---|---|---|
Mean darkness | 0.13 | 0.33 | 0.54 | NF |
Mean disorder | 0.050 | 0.071 | 0.062 | NF |
Number of metagenomic clusters and ORFs with taxonomic annotations (MMseqs2)
K | KWP | GU | EU | |
---|---|---|---|---|
Clusters | 1,038,296 (99%) | 607,250 (96%) | 962,929 (86%) | 21,863 (16%) |
ORFs | 145,940,358 (85%) | 26,179,191 (85%) | 41,743,739 (77%) | 529,320 (16%) |
Minimum | Mean | Median | Maximum | SD | |
---|---|---|---|---|---|
Cluster size | 2 | 63.94 | 13 | 168,822 | 477.61 |
Cluster gene length | 20 | 194.64 | 135 | 27,314 | 0.96 |
Cluster completion | 0 | 0.55 | 0.76 | 1 | 0.45 |
Cluster phylum entropy | 0 | 0.32 | 0 | 5.14 | 0.59 |
Cluster darkness | 0 | 0.03 | 0 | 1 | 0.16 |
Cluster size:
Cluster size | Minimum | Mean | Median | Maximum | SD |
---|---|---|---|---|---|
K | 2 | 139.67 | 21 | 168,822 | 829.99 |
KWP | 2 | 42.83 | 17 | 12,339 | 126.72 |
GU | 2 | 25.97 | 8 | 17,624 | 107.62 |
EU | 2 | 17.36 | 12 | 6,196 | 36.05 |
Cluster gene length:
Cluster gene length | Minimum | Mean | Median | Maximum | SD |
---|---|---|---|---|---|
K | 20 | 258.55 | 187 | 21,337 | 0.95 |
KWP | 20 | 133.22 | 93 | 24,979 | 0.96 |
GU | 20 | 177.16 | 124 | 27,314 | 0.96 |
EU | 20 | 130.65 | 96 | 10,373 | 0.96 |
Cluster completion:
Cluster completion | Minimum | Mean | Median | Maximum | SD |
---|---|---|---|---|---|
K | 0 | 0.50 | 0.36 | 1 | 0.44 |
KWP | 0 | 0.22 | 0.013 | 1 | 0.36 |
GU | 0 | 0.68 | 1 | 1 | 0.42 |
EU | 0 | 0.70 | 0.90 | 1 | 0.39 |
Cluster phylum entropy:
Cluster phylum entropy | Minimum | Mean | Median | Maximum | SD |
---|---|---|---|---|---|
K | 0 | 0.53 | 0 | 5.13 | 0.73 |
KWP | 0 | 0.38 | 0 | 5.0 | 0.56 |
GU | 0 | 0.17 | 0 | 4.80 | 0.43 |
EU | 0 | 0.05 | 0 | 2.49 | 0.24 |
Cluster darkness:
Cluster darkness | Minimum | Mean | Median | Maximum | SD |
---|---|---|---|---|---|
K | 0 | 0.13 | 0.05 | 1 | 0.23 |
KWP | 0 | 0.33 | 0.15 | 1 | 0.35 |
GU | 0 | 0.54 | 0.47 | 1 | 0.43 |
EU | NF | NF | NF | NF | NF |
Cluster disorder:
Cluster disorder | Minimum | Mean | Median | Maximum | SD |
---|---|---|---|---|---|
K | 0 | 0.050 | 0.02 | 1 | 0.087 |
KWP | 0 | 0.071 | 0.03 | 1 | 0.012 |
GU | 0 | 0.062 | 0.01 | 1 | 0.012 |
EU | NF | NF | NF | NF | NF |
Mean entropy + SD
Rank | K | KWP | GU | EU | global |
---|---|---|---|---|---|
Domain | 0.14 +0.27 | 0.17 +0.33 | 0.06 +0.22 | 0.03 +0.16 | 0.10 +0.26 |
Phylum | 0.53 +0.73 | 0.38 +0.56 | 0.17 +0.42 | 0.05 +0.24 | 0.31 +0.59 |
Class | 0.67 +0.84 | 0.48 +0.62 | 0.20 +0.47 | 0.05 +0.21 | 0.40 +0.67 |
Order | 0.87 +1.01 | 0.50 +0.66 | 0.28 +0.57 | 0.06 +0.25 | 0.50 +0.80 |
Family | 1.07 +1.17 | 0.62 +0.74 | 0.36 +0.67 | 0.06 +0.25 | 0.63 +0.93 |
Genus | 1.38 +1.47 | 0.68 +0.82 | 0.54 +0.88 | 0.09 +0.30 | 0.83 +1.17 |
Species | 1.67 +1.44 | 1.16 +0.99 | 0.98 +1.05 | 0.21 +0.45 | 1.23 +1.23 |
[1] Hyatt, Doug, Gwo-Liang Chen, Philip F. LoCascio, Miriam L. Land, Frank W. Larimer, and Loren J. Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification.” BMC Bioinformatics 11 (1): 119–119.
[2] Perdigão, Nelson, Agostinho C. Rosa, and Seán I. O’Donoghue. 2017. “The Dark Proteome Database.” BioData Mining 10 (1): 1–11.
[3] Bennett, K. D. 1996. “Determination of the Number of Zones in a Biostratigraphical Sequence.” The New Phytologist 132 (1): 155–70.