You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thank you for making, maintaining and supporting this great program.
I was hoping you could help me with the following issue I ran into. I get an alignment that is longer than the scaffold on which it sits.
I am annotating a fragmented draft whole-genome assembly, using a well-annotated proteome as reference. A problem I discovered is that in one case, spaln outputs an mRNA in the gff file which is longer than the short scaffold on which it is located.
I am working with butterflies. This is my command: spaln -Q7 -LS -T helimelp -O0,4,7,3 -S3 -M4 -t7 -d $genome $protein_file >> $logfile 2>&1
The $genome is the draft genome assembly, the $protein_file is the proteome reference (I've copied the relevant scaffold and protein below).
This is what is says in the spaln stdout / log: Bad gene coord: scaffold174235_size896_RagTag 0 0 915 129 XP_039748902.1
And this is the snippet of output I get in the .O0 output file (gff file), showing the mRNA of 915nt on the scaffold of 896nt: ##sequence-region scaffold174235_size896_RagTag 1 896 scaffold174235_size896_RagTag ALN gene 1 915 8123 + . ID=gene06393;Name=scaffold174235_size896_RagTag_0 scaffold174235_size896_RagTag ALN mRNA 1 915 8123 + . ID=mRNA06393;Parent=gene06393;Name=scaffold174235_size896_RagTag_0 scaffold174235_size896_RagTag ALN cds 1 915 8123 + 0 ID=cds27801;Parent=mRNA06393;Name=scaffold174235_size896_RagTag_0;Target=XP_039748902.1 1 305 +
(These are 3 lines + a header, don't know why the last 2 lines wrap).
This only happens (I think!) for one gene, so I'm happy to ignore it - unless you think it points to a larger problem with my approach, for instance using scaffolds that are too short. Would you advise filtering out scaffolds e.g. <1kb?
Thank you for your comment, and apologies the long delay in response.
As discussed in Issues #36, this problem occurs as spaln sometimes ‘read through’ the boundary between the relevant entry (chromosome, supercontig, or contig) and the following (unrelated) one on the genomic sequence file, particularly for poorly assembled genome. I think the rate of this phenomena has been much reduced from the day of #36, but I cannot still completely eliminate the issue. As you recognize, this issue is relatively harmless, and you may simply ignore the cases. I will keep it in mind to completely resolve the problem in future.
Dear Osamu,
thank you for making, maintaining and supporting this great program.
I was hoping you could help me with the following issue I ran into. I get an alignment that is longer than the scaffold on which it sits.
I am annotating a fragmented draft whole-genome assembly, using a well-annotated proteome as reference. A problem I discovered is that in one case, spaln outputs an mRNA in the gff file which is longer than the short scaffold on which it is located.
I am working with butterflies. This is my command:
spaln -Q7 -LS -T helimelp -O0,4,7,3 -S3 -M4 -t7 -d $genome $protein_file >> $logfile 2>&1
The $genome is the draft genome assembly, the $protein_file is the proteome reference (I've copied the relevant scaffold and protein below).
This is what is says in the spaln stdout / log:
Bad gene coord: scaffold174235_size896_RagTag 0 0 915 129 XP_039748902.1
And this is the snippet of output I get in the .O0 output file (gff file), showing the mRNA of 915nt on the scaffold of 896nt:
##sequence-region scaffold174235_size896_RagTag 1 896
scaffold174235_size896_RagTag ALN gene 1 915 8123 + . ID=gene06393;Name=scaffold174235_size896_RagTag_0
scaffold174235_size896_RagTag ALN mRNA 1 915 8123 + . ID=mRNA06393;Parent=gene06393;Name=scaffold174235_size896_RagTag_0
scaffold174235_size896_RagTag ALN cds 1 915 8123 + 0 ID=cds27801;Parent=mRNA06393;Name=scaffold174235_size896_RagTag_0;Target=XP_039748902.1 1 305 +
(These are 3 lines + a header, don't know why the last 2 lines wrap).
This only happens (I think!) for one gene, so I'm happy to ignore it - unless you think it points to a larger problem with my approach, for instance using scaffolds that are too short. Would you advise filtering out scaffolds e.g. <1kb?
Any thoughts?
Many thanks, and all the best,
Vicencio
>scaffold174235_size896_RagTag
ATGAACATCGCCGTGGCCACGCTCATCCAGAGTTTGGCGCCGCTCAACTCGGCCGCTAACCCCCTCATATGCTGCATGTTCTCTCCGCACATATACGCCAGCCTCAAGTAAGAGTCACTGCAGGGGTGCCACCAGATCCCCATGACGCAGATGAACATCGCCGTGGCCACGCTCATCCAGAGTTTGGCGCCGCTCAACTCAGCCGCTAACCCACTCATATACTGCACGTTCTCTCCGCACCTATACGCCAGCCTCAAGTAAGAGTAATTGCTGGGGCGTCACCAGATCCCCATGACGCAATTGAACATCACCGTGGTTCTTTCCCTTTTATGGTTTCGTACCCAAAGGCTAAAACCGGTCTCCATTACTTCATTTATGTCTTCGCTGTCCGTTCATCCGACTGTCTACATGTTGTATCTCATGAACCGTGATTTCAACAAATGTTTTTTTACGGCAATAGGTAGACGCTTGAAATATTGATGGAATATTGAGTTATATTAAGACACTTTAAAAACAAATAATTAATTATATGTTAAATAAAATTAATATTGTAATGGGGCTCTCATACATTTACTGTTTATTGTTTATATAATTGACTTTTATAAGTAGGTATTAGGGTAATATAATTATCATTGACATCCTATAGTAATCGTTACACAGTGAAATATTATAATCCAACAGATACTTGTAAGTATTGGAGTAGTGTGGTGGATCTAAGCTCTATGTTTCCTCTACTATAAGGTGGGAGGCACAGCTTTTGGGCTGTTAAAATACACTAATAGACAATGATGACTTGATGAGTTTCTCATTTGAAGTTGCATTGTGATGTGTCGGTGAAAATAACACACATGTATAATTATTTCAGACGCGTGCCACCGTACCGCTGGTTCTGGTGC
>XP_039748902.1 #query_protein_from_reference_proteome
MDPELTLEQLEEVTTTLPFQTNATQNLNVYYFYDSIQFTVMWVLFVLIVVLNTSVIAALLCTNARKSRMNFFIMHLAIADLFVGLIYVFVDIVQKITIAWLAGEFLCKVIKFLQAVVMYASTYVLVALSIDRCDAITNPMNFSRSWNRARALIVSAWIISTVFSIPILILYEIKEVQGQLQCWIELGTARRWQIWMTLVSIMIFVLPALAIAACYAVIVLTIWTKSKAVVMSPPMNNKRGKMRTGQVECDPDSRRASSRGLIPRAKIKSVKMTFVIVFVFVLCWSPYIVFDLLQVYHQIPMTQMNIAIATLIQSLAPLNSAANPLICCMFSPHIYASLKRVPPYRWFWCGTQRRRHVRGTARSRSDSTGHSDLLSSTHARRSHSVAMILNRTRGNSISRSQEQRRTHMVTLPLRD
The text was updated successfully, but these errors were encountered: