layout	title
page	Cluster update - Genome Taxonomy Database genomes

Clusters analysis of the Genome Taxonomy Database and integration with the metagenomic database

The Genome Taxonomy Database (GTDB):

Genomes	Archaea	Bacteria
127,318	2,075	125,243

Release 03-RS86 (19th August 2018)

"A phylogeny inferred from the concatenation of 120 ubiquitous single-copy proteins, and we used this phylogeny to propose a bacterial taxonomy that covers 94,759 bacterial genomes, including 13,636 (14.4%) from uncultured organisms (metagenome- assembled or single-cell genomes). Taxonomic groups in this classification describe monophyletic lineages of similar phylogenetic depth after normalization for lineage-specific rates of evolution. This taxonomy, which we have named the GTDB taxonomy, is publicly available at the Genome Taxonomy Database website (http://gtdb.ecogenomic.org/)." (Parks et al. 2018) [1]

Bacterial and archaeal ORFs from the Genome Taxonomy Database (GTDB) Release 03-RS86 (19th August 2018) [1] were integrated in our clusters. For the integration we used the same data used and processed by Mendler at al. (Annotree) [2]. The dataset consist of ~94M bacterial and archaeal ORFs stemming from 28,941 genomes (27,372 bacterial and 1,569 archaeal). We used the clusterupdate program of the MMSeqs2 software [3], with the same parameters applied for the original clustering (minimum sequence identity set to 30% and bi-directional coverage to 80%). The new sequences were queried against the old cluster representatives. Those that passed the specified thresholds were merged with the previous clustering. The remaining unmapped sequences were then clustered separately, following the same steps of the original clustering. The new clusters composed solely of GTDB ORFs were validated and classified following the main workflow steps.

Methods

1) Incremental clustering update with the GTDB sequences:
We downloaded the protein sequences from bacterial and archaeal genomes stored in the Annotree website at https://data.ace.uq.edu.au/public/misc_downloads/annotree/r86/.

NB: We renamed the predicted genes to follow the nomenclature used for the metagenomic dataset.

We collected 90,621,864 proteins from 27,372 bacterial genomes, and 3,101,326 from 1,569 archaeal genomes:

	Genomes	Proteins
Bacterial	27,372	90,621,864
Archaeal	1,569	3,101,326
Total	28,941	93,723,190
{: style="margin-left: auto; margin-right: auto; width: 60%"}

Number of genomes and predicted genes by Prodigal {: style="color:gray; font-size: 90%; text-align: center;"}

Number of predicted genes per completeness category.

Total	“00”	“10”	“01”	“11”
93,723,190	90,388,583	1,604,489	1,594,074	136,044

Note: “00”=complete, both start and stop codon identified. “01”=right boundary incomplete. “10”=left boundary incomplete. “11”=both left and right edges incomplete.

We then integrated the collected genomes in the metagenomic gene clusters, using the incremental clustering algorithm of MMseqs2.

Scripts and description: The renaming script renaming_orfs.sh downloads and renames the GTDB predicted genes. The genomic genes are then merged with the metagenomic ones with the script combined_seq_db.sh. The incremental clustering script clustering_update.sh takes in input the old cluster database in MMseqs2 format, and the new combined sequence database. The output is a new integrated cluster database. To further parse the results the script clustering_updt_res.sh is used. From the parsing we obtain a sequence database of the clusters, tables containing information about the cluster representative, the size and the cluster members, and whether a cluster is from the original clustering, an updated one or a new cluster, i.e made of only new sequeunces.

2) The only-GTDB clusters are parsed through the next steps of the workflow:

Scripts and description: The scripts are the same used on the original clustering results.

Results

Cluster update results

Metagenomes + Genomes cluster database ("mg_gtdb_db_20190502", MG+GTDB): 72% of GTDB ORFs (67,446,376) were found in our clusters (in 2,163,381 metagenomic clusters).

MG + GTDB cluster DB

	Metagenomes	GTDB genomes	Total
Clusters	32,465,074	7,958,475	40,423,549
ORFs	322,248,552	93,723,190	415,971,742

The remaining (unmapped) 26M (28%) GTDB ORFs were clustered in 7,958,475 clusters. Around 5.6M resulted as singletons.

Only-GTDB clusters:

	GTDB not-singletons	GTDB singletons	Total
Clusters	2,400,037	5,558,438	7,958,475
ORFs	20,718,376	5,558,438	26,276,814

Total GTDB clusters (only-GTDB + GTDB in MG): 10,121,856

The new/only-GTDB non-singletons clusters have an average size, i.e. number of ORFs, of 8.6 (median=3), with a maximum of 18,114 ORFs and a minimum of 2.

2) Characterization of GTDB Clusters

Clusters (> 1 ORFs)

Only-GTDB cluster dataset

Clusters	ORFs
2,400,037	20,718,376

Pfam annotation

We were able to annotate to Pfam protein domain families only 41% of the GTDB ORFs not found in the metagenomic clusters.

Number of cluster ORFs Pfam annotated and not

	Annotated	Not annotated
ORFs	8,404,082 (41%)	12,314,294 (59%)

The distribution of the Pfam annotations in the clusters resulted into a set ~500K annotated clusters and a set of 1.8M not-annotated clusters. Nubers are shown in the following table:

Pfam annotations at the cluster level

	Annotated	Not annotated
Clusters	556,834 (23%)	1,843,203 (77%)
ORFs	10,091,203 (49%)	10,627,173 (51%)

Identification of spurious and shadow ORFs

We screened the AntiFam database to identify potential spurious ORFs. We found only 0.02% of clusters ORFs being potentially spurious.

Spurious ORFs:

ORFs in clusters	ORFs in singletons
3,252 (0.02%)	1,312 (0.02%)

We identified 1% of clusters ORFs as shadows.

Shadow ORFs:

ORFs in clusters	ORFs in singletons
223,535 (1%)	125,262 (2%)

Cluster validation

The cluster validation showed a high cluster quality in terms of both cluster functional and compositional homogeneity. Results for the two validastion steps are reported in the following tables:

Functional validation:

	Good	Bad
Clusters	542,410 (97%)	14,424 (3%)
ORFs	9,865,550 (98%)	225,653 (2%)

Compositional validation:

	Good	Bad
Clusters	2,361,585 (98%)	38,452 (2%)
ORFs	20,364,454 (98%)	353,922 (2%)

Proportion of bad-aligned/non-homologous ORFs detected within each cluster MSA. Distribution of observed values compared with those of the Broken-stick model. The threshold was determined at 11% non-homologous ORFs per cluster.

Overall 98% of the clusters and 97% of their ORFs were classified as "good", and only 2% were discarded.

Combined validation results:

	Good	Bad
Clusters	2,347,502 (98%)	52,535 (2%)
ORFs	20,141,636 (97%)	576,740 (3%)

Good clusters:

Good	Annotated	Not annotated
Clusters	530,503	1,816,999
ORFs	9,749,442	10,392,194

Classification

Cluster consensus vs Uniref90:

We searched the not annoataed clusters consensus against the UniRef90 database. The majority (86%) reported at least one match.

Not annotated cluster consensus vs UniRef90

Hits	No-hits
1,570,094 (86%)	246,905 (14%)

The majority (81%) of the hits were found labeled as "hypothetical" proteins.

Hits functional characterisation in UniRef90

Hits to hypothetical proteins	Hits to not-hypothetical prot.
1,266,090 (70%) (81% of the hits)	304,004 (19% of the hits)

Cluster consensus vs NCBI-nr:

We searched then the 246,905 consensus sequences without a hit in UniRef90 against the NCBI-nr database. Only 12% of the queries was found in the NCBI-nr database.

Consensus UniRef90-nohits vs NCBI-nr

Hits	No-hits
28,704 (12%)	218,201 (88%)

The large majority of the hits (96%) was found annotated to "hypotetical" proteins.

Hits functional characterisation in NCBI-nr

Hits to hypothetical proteins	Hits to not-hypothetical prot.
27,424 (11%) (96% of the hits)	1,280 (4% of the hits)

Classification of the annotated Clusters

We processed the annotated clusters to retrieve the cluster consensus Pfam domain architecture (DA). The retrieved numbers divided into annotated to domains of unknown function (DUFs) and domains of known function (PFs) are reported in the next table.

Annotated DUFs	Annotated PFs (known function)
65,688 (12%)	464,815 (88%)

The only-GTDB cluster categories before the known and unknown refinement resulted dominated by the GUs (58%).

Only-GTDB cluster categories

K	KWP	GU	EU
464,815 (20%)	305,284 (13%)	1,359,202 (58%)	218,201 (9%)

Categories refinement

We found that 69% of the PGUs show remote homology to a Uniclust entry/protein. Of the matching clusters, 144,295 resulted in distant homologs of hypothetical proteins and were moved to the GU category, whereas 5,704 clusters matched characterized proteins and were transferred to the KWP set. Hence, after this refinement step, the number of PGUs reduced to 68,202 clusters. The search of the KWP cluster HMMs against the Pfam database resulted in 56% of KWP clusters being remote homologous of Pfam enries. Of this set, the majority, 152,529 clusters, were annotated to PFs, and 22,053 clusters to DUFs.

Cluster category refinement steps:

	K	KWP	GU	EU
Clusters (pre-EUs_refinement)	464,815	305,284	1,359,202	218,201
EUs refinement	-	+5,704	+144,295	-149,999
Clusters (post-EUs_refinement)	464,815	310,988	1,503,497	68,202
KWPs refinement	+152,529	-174,582	+22,053	-
Clusters (post-KWPs_refinement)	617,344 (26%)	136,406 (6%)	1,525,550 (65%)	68,202 (3%)

After the refinement the only-GTDB clusters appeared dominated by the GU set, which accounts for 65% of the clusters.

Only-GTDB clusters final categories

	K	KWP	GU	PGU	Total
Clusters	617,344	136,406	1,525,550	68,202	2,347,502
ORFs	9,997,529	663,107	9,305,621	175,379	20,141,636

GTDB cluster community inference

At first we searched GTDB clustersin the metagenomic data set communities. Then, for the clusters not found in the MG-communities, we performed the community inference. The best inflation value for the cluster aggregation was determined at 2.5, and we obtained a total of ~500K communities. (Both slightly higher number than those found for the metagenomic dataset/clusters)

Radar plots used to determine the best MCL inflation value for the partitioning of the Ks into cluster components. The plots were built using a combination of five variables: 1=proportion of clusters with 1 component and 2=proportion of clusters with more than 1 member, 3=clan entropy (proportion of clusters with entropy = 0), 4=intra hhblits score-per-column (normalised by the maximum value), and 5=number of clusters (related to the non-redundant set of DAs).

GTDB cluster community categories

	K	KWP	GU	EU	Total
Communities	52,360	47,203	339,468	57,899	496,930
Clusters	617,344	136,406	1,525,550	68,202	2,347,502
ORFs	9,997,529	663,107	9,305,621	175,379	20,141,636

A set of these clusters and cluster communities, was found in the metagenomic communities. Numbers are shown in the following table:

GTDB cluster found in MG communities

	K	KWP	GU	EU	Total
MG-communities	14,241	20,399	69,204	2,799	106,643
Clusters	357,952	82,307	259,190	3,393	702,842

As for the metagenomic cluster database, we retrieved the genomic "High Quality" (mostly complete) set of clusters:

GTDB HQ clusters

Category	HQ cluster	HQ ORFs	pHQ_orfs	pHQ_cl
K	12,202	25,105,156	0.0096	0.0198
KWP	4,019	1,349,165	0.0214	0.0295
GU	12,699	8,403,393	0.0062	0.0083
EU	438	471,820	0.0074	0.0064

MG + GTDB new cluster database and categories

Only MG

	K	KWP	GU	EU	Total
Clusters	1,050,166	632,453	1,121,809	135,829	2,940,257
ORFs	172,147,128	30,601,694	54,052,275	3,341,257	260,142,354

All GTDB (Including ORFs falling in the MG clusters)

	K	KWP	GU	EU	Total
Clusters	1,115,167	263,702	1,814,233	76,999	3,270,101
ORFs	58,494,638	2,152,671	14,457,060	192,950	75,297,319

MG + GTDB

	K	KWP	GU	EU	Total
Clusters	1,667,510	768,859	2,647,359	204,031	5,287,759
ORFs	230,641,766	32,754,365	68,509,335	3,534,207	335,439,673

Only-GTDB Singletons

Total singletons and their completion

Total	Complete "00"	Partial "01"	Partial "10"	Partial "11"
5,558,438	4,682,584	416,323	417,523	42,008

Pfam annotation

Number of Pfam annotated and not-annotated singletons

	Singletons
Annotated	535,012 (10%)
Not-annotated	5,023,426 (90%)

Annotated DUFs	Annotated PF (known function)
61,552 (11%)	473,460 (89%)

Classification

Genomic unknowns: search against UniRef90

Sixty-five percent of the 5,023,426 not annotated singletons, were found in the UniRef90 DB, as shown in the table below:

Not-annotated singletons vs UniRef90

Hits (unique)	No-hits
3,249,823 (65%)	1,773,603 (35%)

The 3,249,823 hits, are divided into 73% hypothetical and 27% characterised UniRef90 protein entries:

Hits functional characterisation in UniRef90

Hits to hypothetical proteins	Hits to not-hypothetical prot.
2,361,654 (47%) (73% of the hits)	888,169 (27% of the hits)

Environmental unknowns: search against NCBI nr database

We searched the consensus seqeunces that reported no matches to the UniRef90 entries (1,773,603) against the NCBI nr database (version 28-08-2017; 130,469,055 non-redundant proteins). We were able to annotate/retrieve only 1.3% of them.

UniRef90-nohits singletons vs NCBI-nr

Hits	No-hits
113,122 (1.3%)	1,660,481 (98.7%)

We then screened the hits for "hypothetical", and we found 93% of them being classified as "hypothetical".

Hits functional characterisation in NCBI-nr

NR hypothetical	NR not-hypothetical
105,164 (0.6%) (93% of the hits)	7,958 (7% of the hits)

Singletons categories

The GTDB singletons appeared to be mostly GUs and PGUs.

K	KWP	GU	EU
473,460 (9%)	896,127 (16%)	2,528,370 (45%)	1,660,481 (30%)

Enrichment of former metagenomic small clusters (less than 10 ORFs) and singletons

A fraction of the GTDB ORFs were found similar to metagenomic singletons or small clusters.

The majority (55,155,683) (82%) of the 67,446,376 GTDB ORFs found in the metagenomic dataset belonged to the refined set of clusters, 3,700,844 ORFs were found in singletons, 7,010,987 ORFs in small clusters (clusters with less than 10 members) and 1,578,862 ORFs in clusters discarded during the validation/refinement step.

After the integration of genomic ORFs, 52,758 singletons and 187,953 small clusters became clusters with more than 10 members.

Singletons now >=10: 52,758 (0.3%) (total singletons now with one or more GTDB orfs 492,238 (2.5%))

Small clusters now >=10: 187,953 (2%) (total small clusters with one or more GTDB orfs 731,492 (8%))

New MG+GTDB clusters with more than 10 ORFs

Clusters	ORFs
240,711	10,136,816

MG singletons	MG singl now LT10	MG singl now GE10
19,419,086	439,480	52,758

MG+GTDB singletons: 24,977,524

Currently under validation ...

References

[1] D. H. Parks, M. Chuvochina, D. W. Waite, C. Rinke, A.Skarshewski, P-A Chaumeil & P. Hugenholtz, "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nat Biotechnology, Aug. 2018.

[2] K. Mendler, H. Chen, D. H. Parks, B. Lobb, L. A. Hug and A. C. Doxey, "AnnoTree: visualization and exploration of a functionally annotated microbial tree of life" Nucleic Acids Research, Mar. 2019.

[3] M. Steinegger and J. Söding, “MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.,” Nature biotechnology, vol. 35, no. 11, pp. 1026–1028, Nov. 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

12_GTDB_genome_integration.md

12_GTDB_genome_integration.md

Clusters analysis of the Genome Taxonomy Database and integration with the metagenomic database

Methods

Results

Cluster update results

2) Characterization of GTDB Clusters

Clusters (> 1 ORFs)

Cluster consensus vs Uniref90:

Cluster consensus vs NCBI-nr:

Classification of the annotated Clusters

MG + GTDB new cluster database and categories

Only-GTDB Singletons

Genomic unknowns: search against UniRef90

Environmental unknowns: search against NCBI nr database

Enrichment of former metagenomic small clusters (less than 10 ORFs) and singletons

References

Files

12_GTDB_genome_integration.md

Latest commit

History

12_GTDB_genome_integration.md

File metadata and controls

Clusters analysis of the Genome Taxonomy Database and integration with the metagenomic database

Methods

Results

Cluster update results

2) Characterization of GTDB Clusters

Clusters (> 1 ORFs)

Cluster consensus vs Uniref90:

Cluster consensus vs NCBI-nr:

Classification of the annotated Clusters

MG + GTDB new cluster database and categories

Only-GTDB Singletons

*Genomic unknowns*: search against UniRef90

*Environmental unknowns*: search against NCBI nr database

Enrichment of former metagenomic small clusters (less than 10 ORFs) and singletons

References

Genomic unknowns: search against UniRef90

Environmental unknowns: search against NCBI nr database