layout | title |
---|---|
page |
Cluster update - Genome Taxonomy Database genomes |
The Genome Taxonomy Database (GTDB):
Genomes | Archaea | Bacteria |
---|---|---|
127,318 | 2,075 | 125,243 |
Release 03-RS86 (19th August 2018)
"A phylogeny inferred from the concatenation of 120 ubiquitous single-copy proteins, and we used this phylogeny to propose a bacterial taxonomy that covers 94,759 bacterial genomes, including 13,636 (14.4%) from uncultured organisms (metagenome- assembled or single-cell genomes). Taxonomic groups in this classification describe monophyletic lineages of similar phylogenetic depth after normalization for lineage-specific rates of evolution. This taxonomy, which we have named the GTDB taxonomy, is publicly available at the Genome Taxonomy Database website (http://gtdb.ecogenomic.org/)." (Parks et al. 2018) [1]
Bacterial and archaeal ORFs from the Genome Taxonomy Database (GTDB) Release 03-RS86 (19th August 2018) [1] were integrated in our clusters. For the integration we used the same data used and processed by Mendler at al. (Annotree) [2]. The dataset consist of ~94M bacterial and archaeal ORFs stemming from 28,941 genomes (27,372 bacterial and 1,569 archaeal). We used the clusterupdate program of the MMSeqs2 software [3], with the same parameters applied for the original clustering (minimum sequence identity set to 30% and bi-directional coverage to 80%). The new sequences were queried against the old cluster representatives. Those that passed the specified thresholds were merged with the previous clustering. The remaining unmapped sequences were then clustered separately, following the same steps of the original clustering. The new clusters composed solely of GTDB ORFs were validated and classified following the main workflow steps.
1) Incremental clustering update with the GTDB sequences:
We downloaded the protein sequences from bacterial and archaeal genomes stored in the Annotree website at https://data.ace.uq.edu.au/public/misc_downloads/annotree/r86/.
NB: We renamed the predicted genes to follow the nomenclature used for the metagenomic dataset.
We collected 90,621,864 proteins from 27,372 bacterial genomes, and 3,101,326 from 1,569 archaeal genomes:
Genomes | Proteins | |
---|---|---|
Bacterial | 27,372 | 90,621,864 |
Archaeal | 1,569 | 3,101,326 |
Total | 28,941 | 93,723,190 |
{: style="margin-left: auto; margin-right: auto; width: 60%"} |
Number of genomes and predicted genes by Prodigal {: style="color:gray; font-size: 90%; text-align: center;"}
Number of predicted genes per completeness category.
Total | “00” | “10” | “01” | “11” |
---|---|---|---|---|
93,723,190 | 90,388,583 | 1,604,489 | 1,594,074 | 136,044 |
Note: “00”=complete, both start and stop codon identified. “01”=right boundary incomplete. “10”=left boundary incomplete. “11”=both left and right edges incomplete.
We then integrated the collected genomes in the metagenomic gene clusters, using the incremental clustering algorithm of MMseqs2.
Scripts and description: The renaming script renaming_orfs.sh downloads and renames the GTDB predicted genes. The genomic genes are then merged with the metagenomic ones with the script combined_seq_db.sh. The incremental clustering script clustering_update.sh takes in input the old cluster database in MMseqs2 format, and the new combined sequence database. The output is a new integrated cluster database. To further parse the results the script clustering_updt_res.sh is used. From the parsing we obtain a sequence database of the clusters, tables containing information about the cluster representative, the size and the cluster members, and whether a cluster is from the original clustering, an updated one or a new cluster, i.e made of only new sequeunces.
2) The only-GTDB clusters are parsed through the next steps of the workflow:
Scripts and description: The scripts are the same used on the original clustering results.
Metagenomes + Genomes cluster database ("mg_gtdb_db_20190502", MG+GTDB): 72% of GTDB ORFs (67,446,376) were found in our clusters (in 2,163,381 metagenomic clusters).
MG + GTDB cluster DB
Metagenomes | GTDB genomes | Total | |
---|---|---|---|
Clusters | 32,465,074 | 7,958,475 | 40,423,549 |
ORFs | 322,248,552 | 93,723,190 | 415,971,742 |
The remaining (unmapped) 26M (28%) GTDB ORFs were clustered in 7,958,475 clusters. Around 5.6M resulted as singletons.
Only-GTDB clusters:
GTDB not-singletons | GTDB singletons | Total | |
---|---|---|---|
Clusters | 2,400,037 | 5,558,438 | 7,958,475 |
ORFs | 20,718,376 | 5,558,438 | 26,276,814 |
Total GTDB clusters (only-GTDB + GTDB in MG): 10,121,856
The new/only-GTDB non-singletons clusters have an average size, i.e. number of ORFs, of 8.6 (median=3), with a maximum of 18,114 ORFs and a minimum of 2.
Only-GTDB cluster dataset
Clusters | ORFs |
---|---|
2,400,037 | 20,718,376 |
- Pfam annotation
We were able to annotate to Pfam protein domain families only 41% of the GTDB ORFs not found in the metagenomic clusters.
Number of cluster ORFs Pfam annotated and not
Annotated | Not annotated | |
---|---|---|
ORFs | 8,404,082 (41%) | 12,314,294 (59%) |
The distribution of the Pfam annotations in the clusters resulted into a set ~500K annotated clusters and a set of 1.8M not-annotated clusters. Nubers are shown in the following table:
Pfam annotations at the cluster level
Annotated | Not annotated | |
---|---|---|
Clusters | 556,834 (23%) | 1,843,203 (77%) |
ORFs | 10,091,203 (49%) | 10,627,173 (51%) |
- Identification of spurious and shadow ORFs
We screened the AntiFam database to identify potential spurious ORFs. We found only 0.02% of clusters ORFs being potentially spurious.
Spurious ORFs:
ORFs in clusters | ORFs in singletons |
---|---|
3,252 (0.02%) | 1,312 (0.02%) |
We identified 1% of clusters ORFs as shadows.
Shadow ORFs:
ORFs in clusters | ORFs in singletons |
---|---|
223,535 (1%) | 125,262 (2%) |
- Cluster validation
The cluster validation showed a high cluster quality in terms of both cluster functional and compositional homogeneity. Results for the two validastion steps are reported in the following tables:
Functional validation:
Good | Bad | |
---|---|---|
Clusters | 542,410 (97%) | 14,424 (3%) |
ORFs | 9,865,550 (98%) | 225,653 (2%) |
Compositional validation:
Good | Bad | |
---|---|---|
Clusters | 2,361,585 (98%) | 38,452 (2%) |
ORFs | 20,364,454 (98%) | 353,922 (2%) |

Proportion of bad-aligned/non-homologous ORFs detected within each cluster MSA. Distribution of observed values compared with those of the Broken-stick model. The threshold was determined at 11% non-homologous ORFs per cluster.
Overall 98% of the clusters and 97% of their ORFs were classified as "good", and only 2% were discarded.
Combined validation results:
Good | Bad | |
---|---|---|
Clusters | 2,347,502 (98%) | 52,535 (2%) |
ORFs | 20,141,636 (97%) | 576,740 (3%) |
Good clusters:
Good | Annotated | Not annotated |
---|---|---|
Clusters | 530,503 | 1,816,999 |
ORFs | 9,749,442 | 10,392,194 |
- Classification
We searched the not annoataed clusters consensus against the UniRef90 database. The majority (86%) reported at least one match.
Not annotated cluster consensus vs UniRef90
Hits | No-hits |
---|---|
1,570,094 (86%) | 246,905 (14%) |
The majority (81%) of the hits were found labeled as "hypothetical" proteins.
Hits functional characterisation in UniRef90
Hits to hypothetical proteins | Hits to not-hypothetical prot. |
---|---|
1,266,090 (70%) (81% of the hits) | 304,004 (19% of the hits) |
We searched then the 246,905 consensus sequences without a hit in UniRef90 against the NCBI-nr database. Only 12% of the queries was found in the NCBI-nr database.
Consensus UniRef90-nohits vs NCBI-nr
Hits | No-hits |
---|---|
28,704 (12%) | 218,201 (88%) |
The large majority of the hits (96%) was found annotated to "hypotetical" proteins.
Hits functional characterisation in NCBI-nr
Hits to hypothetical proteins | Hits to not-hypothetical prot. |
---|---|
27,424 (11%) (96% of the hits) | 1,280 (4% of the hits) |
We processed the annotated clusters to retrieve the cluster consensus Pfam domain architecture (DA). The retrieved numbers divided into annotated to domains of unknown function (DUFs) and domains of known function (PFs) are reported in the next table.
Annotated DUFs | Annotated PFs (known function) |
---|---|
65,688 (12%) | 464,815 (88%) |
The only-GTDB cluster categories before the known and unknown refinement resulted dominated by the GUs (58%).
Only-GTDB cluster categories
K | KWP | GU | EU |
---|---|---|---|
464,815 (20%) | 305,284 (13%) | 1,359,202 (58%) | 218,201 (9%) |
- Categories refinement
We found that 69% of the PGUs show remote homology to a Uniclust entry/protein. Of the matching clusters, 144,295 resulted in distant homologs of hypothetical proteins and were moved to the GU category, whereas 5,704 clusters matched characterized proteins and were transferred to the KWP set. Hence, after this refinement step, the number of PGUs reduced to 68,202 clusters. The search of the KWP cluster HMMs against the Pfam database resulted in 56% of KWP clusters being remote homologous of Pfam enries. Of this set, the majority, 152,529 clusters, were annotated to PFs, and 22,053 clusters to DUFs.
Cluster category refinement steps:
K | KWP | GU | EU | |
---|---|---|---|---|
Clusters (pre-EUs_refinement) | 464,815 | 305,284 | 1,359,202 | 218,201 |
EUs refinement | - | +5,704 | +144,295 | -149,999 |
Clusters (post-EUs_refinement) | 464,815 | 310,988 | 1,503,497 | 68,202 |
KWPs refinement | +152,529 | -174,582 | +22,053 | - |
Clusters (post-KWPs_refinement) | 617,344 (26%) | 136,406 (6%) | 1,525,550 (65%) | 68,202 (3%) |
After the refinement the only-GTDB clusters appeared dominated by the GU set, which accounts for 65% of the clusters.
Only-GTDB clusters final categories
K | KWP | GU | PGU | Total | |
---|---|---|---|---|---|
Clusters | 617,344 | 136,406 | 1,525,550 | 68,202 | 2,347,502 |
ORFs | 9,997,529 | 663,107 | 9,305,621 | 175,379 | 20,141,636 |
- GTDB cluster community inference
At first we searched GTDB clustersin the metagenomic data set communities. Then, for the clusters not found in the MG-communities, we performed the community inference. The best inflation value for the cluster aggregation was determined at 2.5, and we obtained a total of ~500K communities. (Both slightly higher number than those found for the metagenomic dataset/clusters)

Radar plots used to determine the best MCL inflation value for the partitioning of the Ks into cluster components. The plots were built using a combination of five variables: 1=proportion of clusters with 1 component and 2=proportion of clusters with more than 1 member, 3=clan entropy (proportion of clusters with entropy = 0), 4=intra hhblits score-per-column (normalised by the maximum value), and 5=number of clusters (related to the non-redundant set of DAs).
GTDB cluster community categories
K | KWP | GU | EU | Total | |
---|---|---|---|---|---|
Communities | 52,360 | 47,203 | 339,468 | 57,899 | 496,930 |
Clusters | 617,344 | 136,406 | 1,525,550 | 68,202 | 2,347,502 |
ORFs | 9,997,529 | 663,107 | 9,305,621 | 175,379 | 20,141,636 |
A set of these clusters and cluster communities, was found in the metagenomic communities. Numbers are shown in the following table:
GTDB cluster found in MG communities
K | KWP | GU | EU | Total | |
---|---|---|---|---|---|
MG-communities | 14,241 | 20,399 | 69,204 | 2,799 | 106,643 |
Clusters | 357,952 | 82,307 | 259,190 | 3,393 | 702,842 |
As for the metagenomic cluster database, we retrieved the genomic "High Quality" (mostly complete) set of clusters:
GTDB HQ clusters
Category | HQ cluster | HQ ORFs | pHQ_orfs | pHQ_cl |
---|---|---|---|---|
K | 12,202 | 25,105,156 | 0.0096 | 0.0198 |
KWP | 4,019 | 1,349,165 | 0.0214 | 0.0295 |
GU | 12,699 | 8,403,393 | 0.0062 | 0.0083 |
EU | 438 | 471,820 | 0.0074 | 0.0064 |
Only MG
K | KWP | GU | EU | Total | |
---|---|---|---|---|---|
Clusters | 1,050,166 | 632,453 | 1,121,809 | 135,829 | 2,940,257 |
ORFs | 172,147,128 | 30,601,694 | 54,052,275 | 3,341,257 | 260,142,354 |
All GTDB (Including ORFs falling in the MG clusters)
K | KWP | GU | EU | Total | |
---|---|---|---|---|---|
Clusters | 1,115,167 | 263,702 | 1,814,233 | 76,999 | 3,270,101 |
ORFs | 58,494,638 | 2,152,671 | 14,457,060 | 192,950 | 75,297,319 |
MG + GTDB
K | KWP | GU | EU | Total | |
---|---|---|---|---|---|
Clusters | 1,667,510 | 768,859 | 2,647,359 | 204,031 | 5,287,759 |
ORFs | 230,641,766 | 32,754,365 | 68,509,335 | 3,534,207 | 335,439,673 |
Total singletons and their completion
Total | Complete "00" | Partial "01" | Partial "10" | Partial "11" |
---|---|---|---|---|
5,558,438 | 4,682,584 | 416,323 | 417,523 | 42,008 |
- Pfam annotation
Number of Pfam annotated and not-annotated singletons
Singletons | |
---|---|
Annotated | 535,012 (10%) |
Not-annotated | 5,023,426 (90%) |
Annotated DUFs | Annotated PF (known function) |
---|---|
61,552 (11%) | 473,460 (89%) |
- Classification
Sixty-five percent of the 5,023,426 not annotated singletons, were found in the UniRef90 DB, as shown in the table below:
Not-annotated singletons vs UniRef90
Hits (unique) | No-hits |
---|---|
3,249,823 (65%) | 1,773,603 (35%) |
The 3,249,823 hits, are divided into 73% hypothetical and 27% characterised UniRef90 protein entries:
Hits functional characterisation in UniRef90
Hits to hypothetical proteins | Hits to not-hypothetical prot. |
---|---|
2,361,654 (47%) (73% of the hits) | 888,169 (27% of the hits) |
We searched the consensus seqeunces that reported no matches to the UniRef90 entries (1,773,603) against the NCBI nr database (version 28-08-2017; 130,469,055 non-redundant proteins). We were able to annotate/retrieve only 1.3% of them.
UniRef90-nohits singletons vs NCBI-nr
Hits | No-hits |
---|---|
113,122 (1.3%) | 1,660,481 (98.7%) |
We then screened the hits for "hypothetical", and we found 93% of them being classified as "hypothetical".
Hits functional characterisation in NCBI-nr
NR hypothetical | NR not-hypothetical |
---|---|
105,164 (0.6%) (93% of the hits) | 7,958 (7% of the hits) |
- Singletons categories
The GTDB singletons appeared to be mostly GUs and PGUs.
K | KWP | GU | EU |
---|---|---|---|
473,460 (9%) | 896,127 (16%) | 2,528,370 (45%) | 1,660,481 (30%) |
A fraction of the GTDB ORFs were found similar to metagenomic singletons or small clusters.
The majority (55,155,683) (82%) of the 67,446,376 GTDB ORFs found in the metagenomic dataset belonged to the refined set of clusters, 3,700,844 ORFs were found in singletons, 7,010,987 ORFs in small clusters (clusters with less than 10 members) and 1,578,862 ORFs in clusters discarded during the validation/refinement step.
After the integration of genomic ORFs, 52,758 singletons and 187,953 small clusters became clusters with more than 10 members.
Singletons now >=10: 52,758 (0.3%) (total singletons now with one or more GTDB orfs 492,238 (2.5%))
Small clusters now >=10: 187,953 (2%) (total small clusters with one or more GTDB orfs 731,492 (8%))
New MG+GTDB clusters with more than 10 ORFs
Clusters | ORFs |
---|---|
240,711 | 10,136,816 |
MG singletons | MG singl now LT10 | MG singl now GE10 |
---|---|---|
19,419,086 | 439,480 | 52,758 |
MG+GTDB singletons: 24,977,524
Currently under validation ...
[1] D. H. Parks, M. Chuvochina, D. W. Waite, C. Rinke, A.Skarshewski, P-A Chaumeil & P. Hugenholtz, "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nat Biotechnology, Aug. 2018.
[2] K. Mendler, H. Chen, D. H. Parks, B. Lobb, L. A. Hug and A. C. Doxey, "AnnoTree: visualization and exploration of a functionally annotated microbial tree of life" Nucleic Acids Research, Mar. 2019.
[3] M. Steinegger and J. Söding, “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.,” Nature biotechnology, vol. 35, no. 11, pp. 1026–1028, Nov. 2017.