Skip to content

Latest commit

 

History

History
184 lines (119 loc) · 8.99 KB

r-validation.md

File metadata and controls

184 lines (119 loc) · 8.99 KB
layout title
page
Clustering and annotation results

Our de novo ORF cluster validation relies in two different approaches, from one side we test the compositional homogeneity of each cluster based in the sequence space, and from the other side, we evaluate the functional homogeneity of each cluster based on the Pfam annotations.

![](/img/pipeline_validation.png){:height="80%" width="80%"}

Note: All the results presented here are for those clusters >= 10 members

In overall, MMseqs2 does an amazing job creating very homogenous clusters.

Compositional homogeneity

As a brief reminder of our approach:

![](/img/pipeline_validation_ch.png){:height="50%" width="50%"}

Note: The most important factor of this evaluation is the number of sequences rejected in each cluster by LEON-BIS/OD-SEQ

The compositional homogeneity evaluation of the clusters confirms a good cluster quality at the sequence level. Of the ∼2.6 million clusters, 125,390 contain a rejected sequence (i.e. bad aligned sequence). About ∼9.6K clusters have more than 10% rejected sequences and were thus classified as “bad”.

![](/img/results_validation.jpg){:height="50%" width="50%"}

Those 125,390 rejected clusters represented 183,393 ORFs

Clusters ORFs Rejected clusters Rejected sequences
2,624,229 106,201,515 125,390 183,393

Note: For 13 clusters, all belonging to the not annotated set of clusters, we were not able to run mmseqs2 to retrieve the alignments,

From those 183,393 sequences the majority (~67%) were from TARA metagenomes:

Total TARA MALASPINA OSD GOS
183,393 122,938 29,541 9,318 21,596

We decided to use the 10% threshold for the number of rejected sequences after exploring the “scree plot” showing the relationship between number of clusters and rejected sequences:

We also investigated if there is any relationship between the number of rejected sequences and the size of the clusters:

Functional validation

As a brief reminder of our approach:

We are going to check the functional evaluation, but first, we will use the new representatives refined during the MSA based approach described previously. Is very interesting to see that with new representatives we have been able to increase the number of annotated cluster representatives:

Old Rep_annot New Rep_annot Difference Good/Kept new Rep_annot
929,946 952,507 22,561 948,070

We will present the results based on the different categories based on the annotation’s groups

Note: Since we are going to keep only the clusters with high sequence-level homogeneity (good multiple sequence alignment results), we decided to filter the annotated clusters based on the raw Jaccard median similarity, and not on the one scaled by the percentage/proportion of annotated members. We are showing the scaled results just for documentation purposes

Clusters with annotated representatives:

The following plots shows the Jaccard similarity distribution for the comparisons scaled by the number of annotated members in the cluster (right) and the ones not taking in account the number of annotated members (left)

Rep annot clusters Jacc. median raw == 1 Jacc. median scaled > 0.75
952,507           948,302 (99.5%) 777,703 (81.6%)

Based on the type of annotations in each cluster:

Rep_annot HA MoDA MuDA
952,507 839,120 3,866 109,521

HA: Homogeneous annotations
MoDA: Mono-domain different annotations
MuDA: Multi-domain different annotations

Cluster with at least one annotated member (representative excluded):

The following plots shows the Jaccard similarity distribution for the comparisons scaled by the number of annotated members in the cluster (right) and the ones not taking in account the number of annotated members (left)

Other_annot clusters Jacc. median raw == 1 Jacc. median scaled > 0.75
252,617          250,373 (99.1%) 15,474 (6.12%)

Based on the type of annotations in each cluster:

No rep annot HA MoDA MuDA
252,617 243,925 3,706 4,986

HA: Homogeneous annotations
MoDA: Mono-domain different annotations
MuDA: Multi-domain different annotations

All annotated clusters:

The following plots shows the Jaccard similarity distribution for the comparisons scaled by the number of annotated members in the cluster (right) and the ones not taking in account the number of annotated members (left)

Based on the type of annotations in each cluster:

Annot. clusters HA MoDA MuDA
1,205,124 1,083,045 7,572 114,507

HA: Homogeneous annotations
MoDA: Mono-domain different annotations
MuDA: Multi-domain different annotations

Integrating both cluster evaluations strategies

We combined both strategies to have a selection with the highest quality clusters. From 2,624,229 protein clusters:

Clusters with <= 10% rejected sequences (2,614,684)

Clusters Rep-annot Norep-annot No-annot
2,614,684   948,070 251,858 1,414,756
    -     HA MoDA MuDA
- 1,079,301 7,501 113,126

Clusters with > 10% rejected sequences (9,545)

Clusters Rep-annot Norep-annot No-annot
9,545  4,437 759 4,349
    -   HA MoDA MuDA
- 3,744 71 1,381

A more detailed view of the relationship between the proportion of rejected ORFs identified by LEON-BIS and the average ORF similarity in each cluster (In red rejected clusters).

In total we kept 2,614,684 protein clusters with less than 10% rejected sequences. From those, we removed those clusters that had median Jaccard similarity < 1. In total, we had 2,608,331 high quality clusters.

Good clusters Bad clusters
2,608,331 (99%) 15,898 (1%)

Summary of the clusters that have been removed for the downstream analyses:

Pfam-DUF Pfam-not-DUF Not-annotated
549 (2,6%) 11,00 (70%) 4,349 (26.4%)

Summary of the clusters that are included for the downstream analyses:

Pfam-DUF Pfam-not-DUF Not-annotated
65,456 (2.5%)  1,128,119 (43,2%) 1,414,756 (54.3%)