Options for higher protein-to-genome alignment sensitivity #82

ohdongha · 2025-02-05T04:11:04Z

Dear Osamu, @ogotoh

Hello again!

Continuing from issues #78 and #80, we have been testing spaln (v.3.0.6d) to generate protein-to-genome alignments that can provide evidence for our genome annotation pipeline.

For the test, we tried aligning ~217K (mostly) Hymenoptera proteins to a wasp genome and comparing the results with miniprot (v.0.13). Below is the code we tried:

## miniprot command used
miniprot -p 0.4 --outs 0.4 -T 1 -G 300000 -t 8 \
   GCF_034640455.1_iyDiaLong2_genomic.fna query.fa # (1)

## spaln command used
spaln -W gnm -KP GCF_034640455.1_iyDiaLong2_genomic.fna.gz
spaln -Q7 -M25 -d gnm -A0 -O2 -t8                  # (2)
spaln -Q7 -M25 -d gnm -A0 -O2 -t8 -T InsectAp      # (3)

Here is the summary of the results:

(1) miniprot aligned ~169K (77.7%) of the input proteins. Among them, ~22K were aligned to multiple locations.
(2) spaln -Q7 -M25 aligned ~146K (67.2%) of the input proteins. Among them, ~8.8K were aligned to multiple locations.
(3) spaln -Q7 -M25 -T InsectAp aligned slightly more (by a couple of hundreds) but not as much.
spaln -Q4 did not add much compared to spaln -Q7.

...

Looking at the miniprot code, we relaxed some parameters to allow less homologous alignments and longer introns:

 -p FLOAT     min secondary-to-primary score ratio [0.7]      # we used 0.4
 --outs=FLOAT output if score at least FLOAT*bestScore [0.99] # we used 0.4
 -G NUM       max intron size; override -I [200k]             # we used 300k

With these modifications, miniprot produced more homologous protein alignment evidence for gene model predictions. The complete BUSCO contents (hymenoptera_odb10) were 0.2% higher when using miniprot alignments than spaln when all other processes in our genome annotation pipeline were identical:

C:97.3%[S:96.7%,D:0.6%],F:0.8%,M:1.9%,n:5991 # (1)
C:97.1%[S:96.4%,D:0.7%],F:0.8%,M:2.1%,n:5991 # (2) and (3)

We are now comparing gene models annotated and protein alignments by miniprot and spaln to see differences not captured in BUSCO evaluations.

...

My questions are

Are there spaln parameters that allow reporting less homologous secondary alignments and longer introns similar to the miniprot command we used?
Are there any other modifications to increase spaln sensitivity (i.e., more proteins aligned) you would recommend?

Thanks a lot!
Cheers,
Dong-Ha

The text was updated successfully, but these errors were encountered:

ohdongha · 2025-02-06T00:08:08Z

We are now comparing gene models annotated and protein alignments by miniprot and spaln to see differences not captured in BUSCO evaluations.

Upon visual inspection of the alignments on the Genome Data Viewer, we saw a pattern consistent with Figure 2a-c of the spaln v3 paper:

Often spaln v3 alignments covered the entire exon with boundaries matching the RefSeq gene models.
In some cases, miniprot reported partial alignments while spaln did not report any. Also, sometimes spaln misses exons towards the 5' ends.

Here are some GDV examples of the wasp genome chr16 (all URLs live for ~90 days since the last visit)

Examples where spaln alignments better captured the exact coding exon boundaries than miniprot:
https://tinyurl.com/spaln-miniprot-ex1
https://tinyurl.com/spaln-miniprot-ex2
https://tinyurl.com/spaln-miniprot-ex3
https://tinyurl.com/spaln-miniprot-ex4
https://tinyurl.com/spaln-miniprot-ex5
Examples where miniprot had some alignments where spaln did not, or spaln missed the starting exons:
https://tinyurl.com/spaln-miniprot-ex6
https://tinyurl.com/spaln-miniprot-ex7

...

I will now look into what happened with the "missing" BUSCOs only in the annotation using spaln alignments (there were 9~10 of them).

In the meantime, it would be great if there were spaln options to allow more proteins to align, even if sacrificing a bit of the performance.

Thanks!
Dong-Ha

ogotoh · 2025-02-06T02:02:17Z

Dear Dong-Ha,

Thank you for your reports, which are generally consistent with my experience.

The primarily purpose of Spaln is to predict the complete structure (protein coding region only) of the gene orthologous to the query. To improve the mapping sensitivity including paralogs, following options might be effective.
-M m (m > 1)
-XP p (p < 1.0)
and
-yX2
p has a similar function to that of Miniport’s -outs option. By setting a smaller value, say 0.5 than the default (1.0), a larger number of paralogs up to m could potentially be detected. It should be noted, however, setting a large number of m is not recommended, particularly for BUSCO data, because BUSCO genes are basically unique on each genome.

The -yX2 option is intended to find weaker homologs than the default (-yX1). However, I have not yet confirmed whether this intention is actually realized.

< Examples where miniprot had some alignments where spaln did not, or spaln missed the starting exons:

I have tried a few methods to improve the mapping sensitivity. Unfortunately, I have not yet obtained that consistently outperforms to the current method. As for the missing tarting exons, there seems to be a space to be improved. I am very glad if you send me a few such examples.

Finally, you can limit the maximum intron length by the -yM_n_ option (ex. -yM100K).

Osamu,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for higher protein-to-genome alignment sensitivity #82

Options for higher protein-to-genome alignment sensitivity #82

ohdongha commented Feb 5, 2025 •

edited

Loading

ohdongha commented Feb 6, 2025

ogotoh commented Feb 6, 2025

Options for higher protein-to-genome alignment sensitivity #82

Options for higher protein-to-genome alignment sensitivity #82

Comments

ohdongha commented Feb 5, 2025 • edited Loading

ohdongha commented Feb 6, 2025

ogotoh commented Feb 6, 2025

ohdongha commented Feb 5, 2025 •

edited

Loading