Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for higher protein-to-genome alignment sensitivity #82

Open
ohdongha opened this issue Feb 5, 2025 · 2 comments
Open

Options for higher protein-to-genome alignment sensitivity #82

ohdongha opened this issue Feb 5, 2025 · 2 comments

Comments

@ohdongha
Copy link

ohdongha commented Feb 5, 2025

Dear Osamu, @ogotoh

Hello again!

Continuing from issues #78 and #80, we have been testing spaln (v.3.0.6d) to generate protein-to-genome alignments that can provide evidence for our genome annotation pipeline.

For the test, we tried aligning ~217K (mostly) Hymenoptera proteins to a wasp genome and comparing the results with miniprot (v.0.13). Below is the code we tried:

## miniprot command used
miniprot -p 0.4 --outs 0.4 -T 1 -G 300000 -t 8 \
   GCF_034640455.1_iyDiaLong2_genomic.fna query.fa # (1)

## spaln command used
spaln -W gnm -KP GCF_034640455.1_iyDiaLong2_genomic.fna.gz
spaln -Q7 -M25 -d gnm -A0 -O2 -t8                  # (2)
spaln -Q7 -M25 -d gnm -A0 -O2 -t8 -T InsectAp      # (3)

Here is the summary of the results:

  • (1) miniprot aligned ~169K (77.7%) of the input proteins. Among them, ~22K were aligned to multiple locations.
  • (2) spaln -Q7 -M25 aligned ~146K (67.2%) of the input proteins. Among them, ~8.8K were aligned to multiple locations.
  • (3) spaln -Q7 -M25 -T InsectAp aligned slightly more (by a couple of hundreds) but not as much.
  • spaln -Q4 did not add much compared to spaln -Q7.

...

Looking at the miniprot code, we relaxed some parameters to allow less homologous alignments and longer introns:

 -p FLOAT     min secondary-to-primary score ratio [0.7]      # we used 0.4
 --outs=FLOAT output if score at least FLOAT*bestScore [0.99] # we used 0.4
 -G NUM       max intron size; override -I [200k]             # we used 300k

With these modifications, miniprot produced more homologous protein alignment evidence for gene model predictions. The complete BUSCO contents (hymenoptera_odb10) were 0.2% higher when using miniprot alignments than spaln when all other processes in our genome annotation pipeline were identical:

C:97.3%[S:96.7%,D:0.6%],F:0.8%,M:1.9%,n:5991 # (1)
C:97.1%[S:96.4%,D:0.7%],F:0.8%,M:2.1%,n:5991 # (2) and (3)

We are now comparing gene models annotated and protein alignments by miniprot and spaln to see differences not captured in BUSCO evaluations.

...

My questions are

  1. Are there spaln parameters that allow reporting less homologous secondary alignments and longer introns similar to the miniprot command we used?
  2. Are there any other modifications to increase spaln sensitivity (i.e., more proteins aligned) you would recommend?

Thanks a lot!
Cheers,
Dong-Ha

@ohdongha
Copy link
Author

ohdongha commented Feb 6, 2025

We are now comparing gene models annotated and protein alignments by miniprot and spaln to see differences not captured in BUSCO evaluations.

Upon visual inspection of the alignments on the Genome Data Viewer, we saw a pattern consistent with Figure 2a-c of the spaln v3 paper:

  • Often spaln v3 alignments covered the entire exon with boundaries matching the RefSeq gene models.
  • In some cases, miniprot reported partial alignments while spaln did not report any. Also, sometimes spaln misses exons towards the 5' ends.

Here are some GDV examples of the wasp genome chr16 (all URLs live for ~90 days since the last visit)

...

I will now look into what happened with the "missing" BUSCOs only in the annotation using spaln alignments (there were 9~10 of them).

In the meantime, it would be great if there were spaln options to allow more proteins to align, even if sacrificing a bit of the performance.

Thanks!
Dong-Ha

@ogotoh
Copy link
Owner

ogotoh commented Feb 6, 2025

Dear Dong-Ha,

Thank you for your reports, which are generally consistent with my experience.

The primarily purpose of Spaln is to predict the complete structure (protein coding region only) of the gene orthologous to the query. To improve the mapping sensitivity including paralogs, following options might be effective.
-M m (m > 1)
-XP p (p < 1.0)
and
-yX2
p has a similar function to that of Miniport’s -outs option. By setting a smaller value, say 0.5 than the default (1.0), a larger number of paralogs up to m could potentially be detected. It should be noted, however, setting a large number of m is not recommended, particularly for BUSCO data, because BUSCO genes are basically unique on each genome.

The -yX2 option is intended to find weaker homologs than the default (-yX1). However, I have not yet confirmed whether this intention is actually realized.

< Examples where miniprot had some alignments where spaln did not, or spaln missed the starting exons:

I have tried a few methods to improve the mapping sensitivity. Unfortunately, I have not yet obtained that consistently outperforms to the current method. As for the missing tarting exons, there seems to be a space to be improved. I am very glad if you send me a few such examples.

Finally, you can limit the maximum intron length by the -yM_n_ option (ex. -yM100K).

Osamu,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants