How to understand the output of sortgrd (-O1) #29

lanyuchunmo · 2020-09-26T09:50:12Z

I am a new user of spaln， I hope to find homologous genes between different species through sequence alignment, firstly, I identified many new mRNA for my species, and identified the ORF amino acid sequence of each mRNA. Next, I downloaded the fasta genome files of several species, and formated the genomes by "spaln -W" option. According to the manual， I aligned my species's ORF to each of the genomes I downloaded and formated, I choose -O12 as the output format, lastly , I used "sortgrd" module to filter spaln result, now, I choose -O1 as the output format, However, I haven't found the description of output format, I can't understand the meaning of sortgrd's output, for example, I don't know how to parse the meaning of the line beginning with “@”：
@ NW_006890073.1 + ( 1494650 1509500 ) Chipl09869 565 ( 1 565 ) S: 2824.8 =: 84.2 C: 99.8 T#: 88 T-: 1 B#: 0 B-: 0 X: 0
Can anyone give me some hints?

lanyuchunmo · 2020-09-26T10:00:12Z

In addition, I would like to ask how to determine the filter criteria. Can the criteria for closely related species and distant species be the same? when I run sortgrd, -I, -C, -E, -H. which threshold should be used for these parameters?

ogotoh · 2020-09-28T02:17:32Z

Sortgrcd does not support –O1 option, because *.erd, *.grd, and *.qrd files produced with spaln –O12 option do not contain alignment (indel) information. You must run spaln with –O1 option rather than –O12 to see alignment. The default (-O4) output is the ‘exon-oriented’ format. Each ordinary line represents features of each exon. A ‘@’ line delineates information as to one transcript and summarizes its alignment to the genome. The above example means that the alignment is obtained from the plus strand of the region 1494650 1509500 of contig NW_006890073.1 and transcript Chipl09869, the alignment score = 2824.8, sequence identity is 84.2%, coverage is 99.8%, total number of mismatches are 88, total number of blank (gap) characters is 1, mismatches around splicing junctions (10 bp franking regions) are 0, gaps around splicing junctions are 0, and the number of non-canonical splicing junctions is 0. A ‘!’ line delineates genes each of which may possibly express plural isoforms.

You should consider two factors, (1) evolutionary distance between the genome and transcripts and (2) quality of sequences, especially completeness of sequencing and assembly of genomic sequence. For DNA/RNA queries, I prefer to use sortgrcd –F2 option, which set moderately severe filtering conditions (e.g. C=93, P=93 etc). For protein queries, less stringent (or even no) filtering is appropriate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to understand the output of sortgrd (-O1) #29

How to understand the output of sortgrd (-O1) #29

lanyuchunmo commented Sep 26, 2020

lanyuchunmo commented Sep 26, 2020

ogotoh commented Sep 28, 2020

How to understand the output of sortgrd (-O1) #29

How to understand the output of sortgrd (-O1) #29

Comments

lanyuchunmo commented Sep 26, 2020

lanyuchunmo commented Sep 26, 2020

ogotoh commented Sep 28, 2020