Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand the output of sortgrd (-O1) #29

Open
lanyuchunmo opened this issue Sep 26, 2020 · 2 comments
Open

How to understand the output of sortgrd (-O1) #29

lanyuchunmo opened this issue Sep 26, 2020 · 2 comments

Comments

@lanyuchunmo
Copy link

I am a new user of spaln, I hope to find homologous genes between different species through sequence alignment, firstly, I identified many new mRNA for my species, and identified the ORF amino acid sequence of each mRNA. Next, I downloaded the fasta genome files of several species, and formated the genomes by "spaln -W" option. According to the manual, I aligned my species's ORF to each of the genomes I downloaded and formated, I choose -O12 as the output format, lastly , I used "sortgrd" module to filter spaln result, now, I choose -O1 as the output format, However, I haven't found the description of output format, I can't understand the meaning of sortgrd's output, for example, I don't know how to parse the meaning of the line beginning with “@”:
@ NW_006890073.1 + ( 1494650 1509500 ) Chipl09869 565 ( 1 565 ) S: 2824.8 =: 84.2 C: 99.8 T#: 88 T-: 1 B#: 0 B-: 0 X: 0
Can anyone give me some hints?

@lanyuchunmo
Copy link
Author

In addition, I would like to ask how to determine the filter criteria. Can the criteria for closely related species and distant species be the same? when I run sortgrd, -I, -C, -E, -H. which threshold should be used for these parameters?

@ogotoh
Copy link
Owner

ogotoh commented Sep 28, 2020

Sortgrcd does not support –O1 option, because *.erd, *.grd, and *.qrd files produced with spaln –O12 option do not contain alignment (indel) information. You must run spaln with –O1 option rather than –O12 to see alignment. The default (-O4) output is the ‘exon-oriented’ format. Each ordinary line represents features of each exon. A ‘@’ line delineates information as to one transcript and summarizes its alignment to the genome. The above example means that the alignment is obtained from the plus strand of the region 1494650 1509500 of contig NW_006890073.1 and transcript Chipl09869, the alignment score = 2824.8, sequence identity is 84.2%, coverage is 99.8%, total number of mismatches are 88, total number of blank (gap) characters is 1, mismatches around splicing junctions (10 bp franking regions) are 0, gaps around splicing junctions are 0, and the number of non-canonical splicing junctions is 0. A ‘!’ line delineates genes each of which may possibly express plural isoforms.

You should consider two factors, (1) evolutionary distance between the genome and transcripts and (2) quality of sequences, especially completeness of sequencing and assembly of genomic sequence. For DNA/RNA queries, I prefer to use sortgrcd –F2 option, which set moderately severe filtering conditions (e.g. C=93, P=93 etc). For protein queries, less stringent (or even no) filtering is appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants