-
Notifications
You must be signed in to change notification settings - Fork 103
Tutorial: APIs for reading SAM, GFF, etc...
Last Updated: 12/7/2020
A random collection of useful APIs or little tricks to work with SAM/BAM, GTF/GFF, etc.
You need to have installed Cupcake to use following API.
- Subsetting a PacBio BAM file based on a list of wanted ZMWs
- Reading SAM files
- Comparing two isoforms
Here is a Python code snippet you can modify to get a list of ZMWs you want.
For PacBio HiFi (CCS) BAM files, the variable r.qname
will contain the ZMW name like:
`
m64049_200801_085545/12/ccs m64049_200801_085545/13/ccs m64049_200801_085545/14/ccs ...
If your CCS BAM file only contains reads from the same run (same movie name), you can just make a wanted list that is just the ZMWs, like:
12 13 14
In this case using `r.qname.split('/')[1]` will extract just the ZMW number.
If you have multiple movie names and need to be more explicit, such as:
movie1/12 movie2/12 movie1/13
Then change `r.qname.split('/')[1]` to `r.qname[:r.qname.rfind('/')]` will extract `movie/zmw` instead.
import pysam bam_filename = 'ccs.bam' touse = [line.strip() for line in open('zmws_to_use.txt')] f = pysam.AlignmentFile('ccs.bam', 'wb', header=reader.header) reader = pysam.AlignmentFile(bam_filename, 'rb', check_sq=False) for r in reader: if r.qname.split('/')[1] in touse: print(r.qname.split('/')[1]) f.write(r) f.close()
<a name="readsam"/>
## Reading SAM files
The following SAM reader is designed to work with GMAP SAM output or any SAM output that has the same structure as GMAP SAM outputs.
```python
>>> from cupcake.io.BioReaders import GMAPSAMReader
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True)
>>> r = reader.next()
qID: i0HQ|c58287/f1p3/338
sID: chr1
cigar: 112M1D226M
sStart-sEnd: 106761814-106762153
qStart-qEnd: 0-338
segments: [Interval(start=106761814, end=106762153)]
flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
coverage (of query): None
coverage (of subject): None
alignment identity: 0.991150442478
>>> from Bio import SeqIO
>>> query_len_dict = dict((r.id, len(r.seq)) for r in SeqIO.parse(open('demo.quivered_hq.fastq'),'fastq'))
>>> reader = GMAPSAMReader('demo.quivered_hq.fastq.sam', \
has_header=True, \
query_len_dict=query_len_dict)
>>> r = reader.next()
qID: i0HQ|c58287/f1p3/338
sID: chr1
cigar: 112M1D226M
sStart-sEnd: 106761814-106762153
qStart-qEnd: 0-338
segments: [Interval(start=106761814, end=106762153)]
flag: SAMflag(is_paired=False, strand='+', PE_read_num=0)
coverage (of query): 1.0
coverage (of subject): None
alignment identity: 0.991150442478
>>> r.qCoverage
1.0
Test data: test1.gff and test2.gff
This is the underlying API used for the chaining sample script. The script compares two spliced isoforms r1
and r2
(must be of the same chromosome) and compares the splice junctions of the two and categorizes the match as:
- exact --- there is a one-to-one match for each splice junction between
r1
andr2
- super ---
r1
has more junctions thanr2
but for those that are shared, they agree - subset ---
r2
has more junctions thanr1
but for those that are shared, they agree - partial ---
r1
andr2
share some identical junctions but each have at least one additional junction that the other one doesn't - nomatch --- no junction match between
r1
andr2
Note that compare_junctions.compare_junctions
does not care about the differences in 5' start or the 3' end. It only cares about the splice junctions. Also, given the above categorization, it does not matter whether the transcript is on the plus or minus strand.
You can download the test1.gff and test2.gff and test the following.
from cupcake.io import GFF
from cupcake.tofu.compare_junctions import compare_junctions
reader1 = GFF.collapseGFFReader('test1.gff')
reader2 = GFF.collapseGFFReader('test2.gff')
recs1 = dict((r.seqid,r) for r in reader1)
recs2 = dict((r.seqid,r) for r in reader2)
r1 = recs1['PB.979.1']
r2 = recs2['PB.1055.1']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106213, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57118745, 57119058)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'exact'
Here, PB.979.1
from test1 and PB.1055.1
from test2 have the same junctions, so they are categorized as exact match.
We next compare PB.979.1
with PB.1055.2
, where the former is a super of the latter because it has more exons but every shared exon agrees on the junctions.
r2 = recs2['PB.1055.2']
r1.ref_exons
#[Out]# [Interval(57106220, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57118745, 57119054)]
r2.ref_exons
#[Out]# [Interval(57106219, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108164)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'super'
Finally we compare PB.979.2
with PB.1055.3
. Here I have intentionally modified PB.1055.3
to differ from PB.979.2
on the first two (3'-end) exons. The junction on PB.979.2
is 57106336-57106569 whereas for PB.1055.3
it is 57106339-57106568. You can see that there is an extra option in compare_junctions
where you can allow fuzzy matches in the junctions.
r1 = recs1['PB.979.2']
r2 = recs2['PB.1055.3']
r1.ref_exons
#[Out]# [Interval(57106221, 57106336),
#[Out]# Interval(57106569, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57119046, 57119093)]
r2.ref_exons
#[Out]# [Interval(57106221, 57106339),
#[Out]# Interval(57106568, 57106692),
#[Out]# Interval(57106845, 57106974),
#[Out]# Interval(57107320, 57107467),
#[Out]# Interval(57108145, 57108223),
#[Out]# Interval(57108385, 57108471),
#[Out]# Interval(57118235, 57118307),
#[Out]# Interval(57119046, 57119077)]
r1.segments = r1.ref_exons
r2.segments = r2.ref_exons
compare_junctions.compare_junctions(r1, r2)
#[Out]# 'partial'
compare_junctions.compare_junctions(r1, r2, internal_fuzzy_max_dist=5)
#[Out]# 'exact'