-
Notifications
You must be signed in to change notification settings - Fork 103
Tutorial: APIs for reading SAM, GFF, etc...
Last Updated: 05/30/2018
You can download the demo data here to try out the following Cupcake modules.
NOTE: The tutorials use BioPython
, which can you install using the command pip install biopython
if you don't already have it.
The following SAM reader is designed to work with GMAP SAM output or any SAM output that has the same structure as GMAP SAM outputs.
>>> from cupcake.io.BioReaders import GMAPSAMReader
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True)
>>> r = reader.next()
qID: i0HQ|c58287/f1p3/338
sID: chr1
cigar: 112M1D226M
sStart-sEnd: 106761814-106762153
qStart-qEnd: 0-338
segments: [Interval(start=106761814, end=106762153)]
flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
coverage (of query): None
coverage (of subject): None
alignment identity: 0.991150442478
>>> from Bio import SeqIO
>>> query_len_dict = dict((r.id, len(r.seq)) for r in SeqIO.parse(open('demo.quivered_hq.fastq'),'fastq'))
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True, \
query_len_dict=query_len_dict)
>>> r = reader.next()
qID: i0HQ|c58287/f1p3/338
sID: chr1
cigar: 112M1D226M
sStart-sEnd: 106761814-106762153
qStart-qEnd: 0-338
segments: [Interval(start=106761814, end=106762153)]
flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
coverage (of query): 1.0
coverage (of subject): None
alignment identity: 0.991150442478
>>> r.qCoverage
1.0
Test data: test1.gff and test2.gff
This is the underlying API used for the chaining sample script. The script compares two spliced isoforms r1
and r2
(must be of the same chromosome) and compares the splice junctions of the two and categorizes the match as:
- exact --- there is a one-to-one match for each splice junction between
r1
andr2
- super ---
r1
has more junctions thanr2
but for those that are shared, they agree - subset ---
r2
has more junctions thanr1
but for those that are shared, they agree - partial ---
r1
andr2
share some identical junctions but each have at least one additional junction that the other one doesn't - nomatch --- no junction match between
r1
andr2
Note that compare_junctions.compare_junctions
does not care about the differences in 5' start or the 3' end. It only cares about the splice junctions. Also, given the above categorization, it does not matter whether the transcript is on the plus or minus strand.
You can download the test1.gff and test2.gff and test the following.
from cupcake.io import GFF
from cupcake.tofu.compare_junctions import compare_junctions
reader1 = GFF.collapseGFFReader('test1.gff')
reader2 = GFF.collapseGFFReader('test2.gff')
recs1 = dict((r.seqid,r) for r in reader1)
recs2 = dict((r.seqid,r) for r in reader2)
r1 = recs1['PB.979.1']
r2 = recs2['PB.1055.1']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106213, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57118745, 57119058)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'exact'
Here, PB.979.1
from test1 and PB.1055.1
from test2 have the same junctions, so they are categorized as exact match.
We next compare PB.979.1
with PB.1055.2
, where the former is a super of the latter because it has more exons but every shared exon agrees on the junctions.
r2 = recs2['PB.1055.2']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106219, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108164)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'super'
Finally we compare PB.979.2
with PB.1055.3
. Here I have intentionally modified PB.1055.3
to differ from PB.979.2
on the first two (3'-end) exons. The junction on PB.979.2
is 57106336-57106569 whereas for PB.1055.3
it is 57106339-57106568. You can see that there is an extra option in compare_junctions
where you can allow fuzzy matches in the junctions.
r1 = recs1['PB.979.2']
r2 = recs2['PB.1055.3']
r1.ref_exons
#[Out]# [Interval(57106221, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57119046, 57119093)]
r2.ref_exons
#[Out]# [Interval(57106221, 57106339),
#[Out]# Interval(57106568, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57119046, 57119077)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'partial'
compare_junctions.compare_junctions(r1, r2, internal_fuzzy_max_dist=5)
#[Out]# 'exact'