Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short gene for spaln output #39

Open
fangdm opened this issue Apr 23, 2021 · 4 comments
Open

short gene for spaln output #39

fangdm opened this issue Apr 23, 2021 · 4 comments

Comments

@fangdm
Copy link

fangdm commented Apr 23, 2021

Hi, many thanks for making and maintaining this great program. I've run spaln for a performance test, using human genome and its own proteins set with different size from 30 to 5000 (amino acids). It's a great tool! It runs quickly and less gene error messages. However, I noticed a problem that short genes (< 80 aa) had no result except for the following report:
ENSP00000248150 < 0 67 12 11607 11606 0.00 6.59 24 30 2 2 6 5
ENSP00000480507 > 0 40 12 12387 12388 0.00 0.00 18 11 1 0 3 3
ENSP00000300873 > 0 70 13 19963 19963 10.54 6.68 32 37 2 2 6 6
ENSP00000477297 > 0 50 10 3146 3146 17.06 0.00 35 23 1 1 5 3
ENSP00000412536 > 0 30 13 21852 21852 12.02 6.68 15 5 1 0 3 1

And I had tried to modify some parameters, for example: -LS, -Xk3, -Xs3 and -yx0, but there is still no any result, so how to solve the problem. Here are my commands:
makeidx.pl -ip Hsap.mfa.gz
spaln -O0 -Q7 -pw -T homosapi -t12 -M4 -dHsap Hsap_test1.pep > Hsap_test1.spaln.gff3

Thank you.

@ogotoh
Copy link
Owner

ogotoh commented Apr 26, 2021

Dear Fangdm,

Thank you for your comment.

Because of its intrinsic algorithmic nature, Spaln is rather poor at mapping short queries; Spaln is tuned to map relatively long queries, so that short queries are often escape from detection. I consider it difficult to run Spaln at the present speed for all range of query lengths. If your queries consist of only short sequences, it might deserve to try either or both of the following options:
-XS // examine all blocks with positive block score
-Xfn // n (0.1 < n < 0.4) : a factor for controlling threshold (minimum) block score
For example, assuming that fasta files of ENSP00000248150, ENSP00000480507, … are present in the current directory:

spaln -O0 -Q7 -pw -T homosapi -t12 -M4 -dHsap -XS -Xf0.2 ENSP00000248150, ENSP00000480507, … > Out.gff3

will retry to map the failed queries.
Note that I have not yet fully tested these options. It is expected that these options will considerably slowdown the execution rate. Additionally, the -XS option has been disabled in recent releases. I uploaded today a new release (Ver.2.4.4) that restores this option (formally -S instead of -XS. -S option is now used for another purpose).

In a future release, I would like to make Spaln automatically adjust proper options depending on the query length.

Osamu,

@fangdm
Copy link
Author

fangdm commented Apr 26, 2021

Hi, thanks for your reply and updates.

I've tried as you suggested with Ver2.4.4, but I get short genes with false coordinates, whether the short genes with short exons and/or long introns caused higher false positives. I've also tried changing -Xf 0.1 to -Xf 0.4, but it actually produced same result (no result when -Xf 0.1). Were there other parameters that I can do to improve the result for short genes?

Meanwhile, I had tried to annotate my de novo genome with spaln, but I cann’t find a close relative species in table/gnm2tab for -T parameter, how to generate my own species-specific subdirectory with the published relative species.

Thank you.

@ogotoh
Copy link
Owner

ogotoh commented Apr 27, 2021

Thank you for your rapid response. I was not much surprised with your negative report. Probably more fundamental change in the Spaln’s algorithm would be necessary to properly deal with short queries. For a while, I suggest you to use Tblastn or Blat to find relevant genomic loci, then apply Spaln to the found loci in the alignment only mode (-Q0-3).
Spaln -O0 -Q3 -pw -T homosapi -dHsap ‘$chromesome left right’ query

To generate a species-specific parameter set, we need a considerable number (at least 1000) of transcript (cDNA, EST, TSA (Transcriptome Shotgun Assembly)) sequences, together with the genomic sequence, of that species. Then,

  1. Run Spaln with the default parameter set with -O12 (binary output).
  2. Run Sortgrcd with -O15 -F2 options (collect high-quality splice junction pairs).
  3. Run a set of scripts to generate several files that contain splice junction signals, intron length distribution, and if necessary, other information/signals.
  4. Store thus generated files in a subdirectory, ‘tmp’.
  5. Repeat once more 1-4) using the -T tmp option.

The third part is somewhat messy and I have not enough time to write a manual. If you tell me the public site of the genomic and transcript sequences of the relative species, I will make the parameter set of that species.

Osamu,

@fangdm
Copy link
Author

fangdm commented Apr 28, 2021

Hi, many thanks for your kind help.

I've tried as you suggested with the commands: Spaln -O0 -Q3 -pw -T homosapi -dHsap ‘$chromesome left right’ query.
It kept reporting the errors: ‘$chromesome left right’ is not found!

But I'm very sure there was corresponding chromosomes in my genome sequence. And I used another way that I extracted the sequences near the target gene, and successfully predicted the corresponding genes. For this way, in my project, it also had great difficulties if the prediction with tblastn or blat were not accurate.
So, I’m looking forward to your subsequent updates for short genes.

For a species-specific parameter set, for me, it's a complex process that involves a series of steps, and it takes some experience to correct it. I'll trouble you later, HaHa.

Anyway, thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants