layout	title
page	Cluster categories (an overview)

The cluster categories:

Knowns with PFAM (Ks): ORF clusters that have been annotated with a PFAM domains of known function.

Knowns without PFAMs (KWPs): clusters that have a known function, but do not contain PFAM annotations.

Genomic Unknowns (GUs): ORF clusters that have an unknown function (e.g. DUF, hypothetical protein) but are found in sequenced or draft-genomes, or in population genomes (or Metagenome Assembled Genomes).

Environmental Unknowns (EUs): ORF clusters of unknown function that are not found in sequenced or draft genomes, but only in environmental metagenomes.

Cluster categories overview.

Gene clusters and cluster communities

The following table shows the number of kept genes, gene cluster and cluster communities obtained from the combination of the metagenomic and genomic DBs.

NB: Part of the GTDB clusters were found in the MG cluster communities, the rest was then aggregated in new cluster communities. The combined results are shown in table below.

Cluster and cluster community categories:

	K	KWP	GU	EU	Total
Communities	62,300	91,742	416,364	103,195	673,601
Clusters	1,667,510	768,859	2,647,359	204,031	5,287,759
ORFs	232,895,994	32,930,286	68,757,918	3,541,592	338,125,790

Cluster category main statisitcs

Cluster length

Cluster size

Cluster completeness

We retrieved the percentage of completeness for each cluster based on the percentage of complete ORFs (ORFs labeled by Prodigal [1] with "00" in the gene prediction step).

High quality (HQ) set of clusters

Using the completness information we retrieved a set of HQ clusters in terms of percentage of complete ORFs and the presence of a complete representative. The cluster representatives are those retrieved during the compositional validation step (see Cluster validation and refinement paragraph). To determine the clusters that are part of the HQ set, we first applied the broken-stick model [3] to determine a minimum required percentage of complete ORFs per cluster. Then, from the set of clusters above the threshold, we selected only the clusters with a complete representative.

High Quality clusters

Category	HQ cluster	HQ ORFs	pHQ_cl	pHQ_orfs
K	76,718	40,710,936	0.0145	0.120
KWP	16,922	1,733,599	0.00320	0.005132
GU	95,370	9,908,630	0.0180	0.0293
EU	14,207	477,625	0.00269	0.00141
Total	203,217	52,830,790	0.0384	0.1562

As shown in the above table, the category with the highest percentage of HQ, i.e. complete, clusters is that of the EUs with 10% HQ clusters, followed by GUs and Ks. The KWPs have the least complete clusters and as showed in the previous section the highest level of (protein) disorder.

Level of darkness and disorder

The level of darkness is calculated as the percentage of dark, i.e unknown, regions in each ORFs in the clusters, based on the entries of the Dark Proteome Database (DPD), a structural-based database containing information about the molecular conformation of protein regions [2].

Mean level of darkness and disorder for each cluster category, based on the DPD data. The average level per category was obtained calculating the mean of each cluster percentage of darkness and disorder, which is based on the values retrieved for each ORF. We didn't retrieve any darkness information about the EUs (they were not found in the DPD database). The other categories show a degree of darkness inversely proportional to their functional characterisation. The highest level of disorder instead was found in the KWP clusters.

Number of GCs annotated to the DPD per functional category

	K	KWP	GU	EU
Annotated clusters	237,511	7,205	8,688	0

Level of darkness and disorder per category

	K	KWP	GU	EU
Mean darkness	0.13	0.33	0.54	NF
Mean disorder	0.050	0.071	0.062	NF

Taxonomy (and cluster taxonomic homogeneity)

Number of metagenomic clusters and ORFs with taxonomic annotations (MMseqs2)

	K	KWP	GU	EU
Clusters	1,038,296 (99%)	607,250 (96%)	962,929 (86%)	21,863 (16%)
ORFs	145,940,358 (85%)	26,179,191 (85%)	41,743,739 (77%)	529,320 (16%)

5. General cluster statistics

	Minimum	Mean	Median	Maximum	SD
Cluster size	2	63.94	13	168,822	477.61
Cluster gene length	20	194.64	135	27,314	0.96
Cluster completion	0	0.55	0.76	1	0.45
Cluster phylum entropy	0	0.32	0	5.14	0.59
Cluster darkness	0	0.03	0	1	0.16

6. General cluster statistics grouped by cluster category

Cluster size:

Cluster size	Minimum	Mean	Median	Maximum	SD
K	2	139.67	21	168,822	829.99
KWP	2	42.83	17	12,339	126.72
GU	2	25.97	8	17,624	107.62
EU	2	17.36	12	6,196	36.05

Cluster gene length:

Cluster gene length	Minimum	Mean	Median	Maximum	SD
K	20	258.55	187	21,337	0.95
KWP	20	133.22	93	24,979	0.96
GU	20	177.16	124	27,314	0.96
EU	20	130.65	96	10,373	0.96

Cluster completion:

Cluster completion	Mean	Median	Maximum	SD
K	0.50	0.36	1	0.44
KWP	0.22	0.013	1	0.36
GU	0.68	1	1	0.42
EU	0.70	0.90	1	0.39

Cluster phylum entropy:

Cluster phylum entropy	Mean	Maximum	SD
K	0.53	5.13	0.73
KWP	0.38	5.0	0.56
GU	0.17	4.80	0.43
EU	0.05	2.49	0.24

Cluster darkness:

Cluster darkness	Minimum	Mean	Median	Maximum	SD
K	0	0.13	0.05	1	0.23
KWP	0	0.33	0.15	1	0.35
GU	0	0.54	0.47	1	0.43
EU	NF	NF	NF	NF	NF

Cluster disorder:

Cluster disorder	Minimum	Mean	Median	Maximum	SD
K	0	0.050	0.02	1	0.087
KWP	0	0.071	0.03	1	0.012
GU	0	0.062	0.01	1	0.012
EU	NF	NF	NF	NF	NF

7. Taxonomic entropy summary

Mean entropy + SD

Rank	K	KWP	GU	EU	global
Domain	0.14 +0.27	0.17 +0.33	0.06 +0.22	0.03 +0.16	0.10 +0.26
Phylum	0.53 +0.73	0.38 +0.56	0.17 +0.42	0.05 +0.24	0.31 +0.59
Class	0.67 +0.84	0.48 +0.62	0.20 +0.47	0.05 +0.21	0.40 +0.67
Order	0.87 +1.01	0.50 +0.66	0.28 +0.57	0.06 +0.25	0.50 +0.80
Family	1.07 +1.17	0.62 +0.74	0.36 +0.67	0.06 +0.25	0.63 +0.93
Genus	1.38 +1.47	0.68 +0.82	0.54 +0.88	0.09 +0.30	0.83 +1.17
Species	1.67 +1.44	1.16 +0.99	0.98 +1.05	0.21 +0.45	1.23 +1.23

References

[1] Hyatt, Doug, Gwo-Liang Chen, Philip F. LoCascio, Miriam L. Land, Frank W. Larimer, and Loren J. Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification.” BMC Bioinformatics 11 (1): 119–119.

[2] Perdigão, Nelson, Agostinho C. Rosa, and Seán I. O’Donoghue. 2017. “The Dark Proteome Database.” BioData Mining 10 (1): 1–11.

[3] Bennett, K. D. 1996. “Determination of the Number of Zones in a Biostratigraphical Sequence.” The New Phytologist 132 (1): 155–70.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

17.1_Cluster_and_communities_overview.md

17.1_Cluster_and_communities_overview.md

Gene clusters and cluster communities

Cluster category main statisitcs

Cluster length

Cluster size

Cluster completeness

High quality (HQ) set of clusters

Level of darkness and disorder

Taxonomy (and cluster taxonomic homogeneity)

5. General cluster statistics

6. General cluster statistics grouped by cluster category

7. Taxonomic entropy summary

References

Files

17.1_Cluster_and_communities_overview.md

Latest commit

History

17.1_Cluster_and_communities_overview.md

File metadata and controls

Gene clusters and cluster communities

Cluster category main statisitcs

Cluster length

Cluster size

Cluster completeness

High quality (HQ) set of clusters

Level of darkness and disorder

Taxonomy (and cluster taxonomic homogeneity)

5. General cluster statistics

6. General cluster statistics grouped by cluster category

7. Taxonomic entropy summary

References