Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Tutorial: APIs for reading SAM, GFF, etc...

Elizabeth Tseng edited this page May 30, 2018 · 8 revisions

Last Updated: 05/30/2018

You can download the demo data here to try out the following Cupcake modules.

NOTE: The tutorials use BioPython, which can you install using the command pip install biopython if you don't already have it.



Reading SAM files

The following SAM reader is designed to work with GMAP SAM output or any SAM output that has the same structure as GMAP SAM outputs.

>>> from cupcake.io.BioReaders import GMAPSAMReader
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True)
>>> r = reader.next()

        qID: i0HQ|c58287/f1p3/338
        sID: chr1
        cigar: 112M1D226M
        sStart-sEnd: 106761814-106762153
        qStart-qEnd: 0-338
        segments: [Interval(start=106761814, end=106762153)]
        flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
        
        coverage (of query): None
        coverage (of subject): None
        alignment identity: 0.991150442478

>>> from Bio import SeqIO
>>> query_len_dict = dict((r.id, len(r.seq)) for r in SeqIO.parse(open('demo.quivered_hq.fastq'),'fastq'))
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True, \
query_len_dict=query_len_dict)
>>> r = reader.next()


        qID: i0HQ|c58287/f1p3/338
        sID: chr1
        cigar: 112M1D226M
        sStart-sEnd: 106761814-106762153
        qStart-qEnd: 0-338
        segments: [Interval(start=106761814, end=106762153)]
        flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
        
        coverage (of query): 1.0
        coverage (of subject): None
        alignment identity: 0.991150442478

>>> r.qCoverage

1.0

Comparing two isoforms

Test data: test1.gff and test2.gff

This is the underlying API used for the chaining sample script. The script compares two spliced isoforms r1 and r2 (must be of the same chromosome) and compares the splice junctions of the two and categorizes the match as:

  • exact --- there is a one-to-one match for each splice junction between r1 and r2
  • super --- r1 has more junctions than r2 but for those that are shared, they agree
  • subset --- r2 has more junctions than r1 but for those that are shared, they agree
  • partial --- r1 and r2 share some identical junctions but each have at least one additional junction that the other one doesn't
  • nomatch --- no junction match between r1 and r2

Note that compare_junctions.compare_junctions does not care about the differences in 5' start or the 3' end. It only cares about the splice junctions. Also, given the above categorization, it does not matter whether the transcript is on the plus or minus strand.

You can download the test1.gff and test2.gff and test the following.

from cupcake.io import GFF
from cupcake.tofu.compare_junctions import compare_junctions

reader1 = GFF.collapseGFFReader('test1.gff')
reader2 = GFF.collapseGFFReader('test2.gff')
recs1 = dict((r.seqid,r) for r in reader1)
recs2 = dict((r.seqid,r) for r in reader2)

r1 = recs1['PB.979.1']
r2 = recs2['PB.1055.1']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106213, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57118745, 57119058)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'exact'

Here, PB.979.1 from test1 and PB.1055.1 from test2 have the same junctions, so they are categorized as exact match.

We next compare PB.979.1 with PB.1055.2, where the former is a super of the latter because it has more exons but every shared exon agrees on the junctions.

r2 = recs2['PB.1055.2']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106219, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108164)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'super'

Finally we compare PB.979.2 with PB.1055.3. Here I have intentionally modified PB.1055.3 to differ from PB.979.2 on the first two (3'-end) exons. The junction on PB.979.2 is 57106336-57106569 whereas for PB.1055.3 it is 57106339-57106568. You can see that there is an extra option in compare_junctions where you can allow fuzzy matches in the junctions.

r1 = recs1['PB.979.2']
r2 = recs2['PB.1055.3']
r1.ref_exons
#[Out]# [Interval(57106221, 57106336),
#[Out]#  Interval(57106569, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57119046, 57119093)]
r2.ref_exons
#[Out]# [Interval(57106221, 57106339),
#[Out]#  Interval(57106568, 57106692),
#[Out]#  Interval(57106845, 57106974),
#[Out]#  Interval(57107320, 57107467),
#[Out]#  Interval(57108145, 57108223),
#[Out]#  Interval(57108385, 57108471),
#[Out]#  Interval(57118235, 57118307),
#[Out]#  Interval(57119046, 57119077)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'partial'
compare_junctions.compare_junctions(r1, r2, internal_fuzzy_max_dist=5)
#[Out]# 'exact'