-
Notifications
You must be signed in to change notification settings - Fork 103
Sequence Manipulation Wiki
Last Updated: 10/02/2019
To use this repo (just the sequence/
scripts, at least), simply add the directory to your PYTHONPATH and PATH. No compilation is required.
export PYTHONPATH=$PYTHONPATH:<path_to_Cupcake>/sequence
export PATH=$PATH:<path_to_Cupcake>/sequence
get_seq_stats.py
: Summarize length distribution of a FASTA/FASTQ file.fa2fq.py
andfq2fa.py
: Convert between FASTA and FASTQ format.rev_comp.py
: Reverse complement a sequence from command line.sort_fasta_by_len.py
: sort fasta file by length (increasing or decreasing).-
get_seqs_from_list.py
: extract list of sequences given a fasta file and a list of IDs. -
err_correct_w_genome.py
: generate fasta sequences given genome and SAM file. -
calc_expected_accuracy_from_fastq.py
: Calculate expected accuracy from FASTQ file. -
filter_lq_isoforms.py
: Filter LQ isoforms using customized FL count and expected accuracy cutoff. -
sam_to_bam.py
: Convert SAM to BAM using samtools. -
sam_to_gff3.py
: Convert SAM to GFF3 format using BioPython and BCBio GFF library. -
group_ORF_sequences.py
: Group identical ORF sequences.
Usage:
get_seq_stats.py <fasta_or_fastq_filename>
Example:
$ get_seq_stats.py test.fa
file type is: fasta
50 sequences
min: 387
max: 7394
avg: 1906.34
Length Breakdown by kb range:
0-1 kb: 17
1-2 kb: 10
2-3 kb: 18
3-4 kb: 3
4-5 kb: 0
5-6 kb: 0
6-7 kb: 0
7-8 kb: 2
Usage:
fa2fq.py <fasta_filename>
fq2fa.py <fastq_filename>
Example:
fa2fq.py test.fasta
fq2fa.py test.fastq
Usage:
revcomp.py <seq1> [seq2] [seq3] [seq4]
Example:
$ revcomp.py AAACCT
AGGTTT
$ revcomp.py AAACCT GGCCGAA
AGGTTT
TTCGGCC
Usage:
sort_fasta_by_len.py [-r] <fasta_filename>
Example:
$ sort_fasta_by_len.py test.fa (sort by increasing length)
$ sort_fasta_by_len.py -r test.fa (sort by decreasing length)
Usage:
get_seqs_from_list.py <fasta_filename> <list_filename>
list_filename
should be a text file where each line is a sequence ID to be found in the fasta file.
Example:
get_seqs_from_list.py test.fasta list.fasta > extracted.fasta
NOTE: for this script to properly work (since it depends on other scripts in the same directory), you must add to your PATH variable the sequence/
subdirectory of the repo. For example:
export PATH=$PATH:<path_to_Cupcake>/sequence/
Usage:
err_correct_w_genome.py <genome_fasta> <SAM_file> <output_fasta_filename>
genome_fasta
should be the genomic fasta file (ex: hg38.fa
), but note that the program reads the entire genome so it may take a while and you must have enough memory!
SAM_file
is the SAM alignment file from which to reconstitute your genomic sequences. A common example would be the GMAP SAM output of a PacBio transcriptome dataset. The SAM file does not need to be sorted.
Example (including an example GMAP command to first generate the SAM file):
gmap -D /home/gmap_db -d hg38 -f samse -n 0 -t 12 input.fasta > input.fasta.sam 2> input.fasta.sam.log
err_correct_w_genome.py hg38.fa input.fasta.sam input.hg38_corrected.fasta
This is for calculating expected accuracy based on FASTQ QV. Useful for looking at expected accuracy in low-quality (LQ) isoforms.
usage: calc_expected_accuracy_from_fastq.py [-h] [--qv_trim_5 QV_TRIM_5]
[--qv_trim_3 QV_TRIM_3]
fastq_filename output_filename
An example usage would be:
python calc_expected_accuracy_from_fastq.py lq_isoforms.fastq lq_isoforms.with_exp_acc.fastq
This would add an extra field at the end of the sequence header, for example:
i0_LQ_sample32f89e|c21/f13p0/373 isoform=c21; \
full_length_coverage=13; \
non_full_length_coverage=0; \
isoform_length=373; \
expected_accuracy=0.365
IMPORTANT: The FASTQ sequence IDs must already have expected_accuracy=
. If not, please run calc_expected_accuracy_from_fastq.py
first to modify the sequence IDs.
Usage:
usage: filter_lq_isoforms.py [-h] [--min_fl_count MIN_FL_COUNT]
[--min_exp_acc MIN_EXP_ACC]
fastq_filename output_filename
An example usage is:
python filter_lq_isoforms.py lq_isoforms.with_exp_acc.fastq lq_filtered.fastq --min_fl_count=1 --min_exp_acc=0.95
While running, you will see messages like:
Including i10_LQ_sample32f89e|c526/f1p0/10181 into output file (FL: 1, acc: 0.968).
Including i10_LQ_sample32f89e|c545/f1p35/10190 into output file (FL: 1, acc: 0.993).
Including i10_LQ_sample32f89e|c550/f1p0/10108 into output file (FL: 1, acc: 0.97).
Including i11_LQ_sample32f89e|c204/f1p0/11961 into output file (FL: 1, acc: 0.994).
Including i11_LQ_sample32f89e|c313/f1p0/11313 into output file (FL: 1, acc: 0.989).
usage: sam_to_bam.py sam_filename
Uses samtools to convert SAM to BAM file.
Convert SAM file to GFF3 format using BCBio GFF library. Requires installing BCBio first.
usage: sam_to_gff3.py [-i INPUT_FASTA] sam_filename
You can predict ORF sequences using tools such as GMST and ANGEL. Different isoforms may encode for the same protein sequences. This script will group fully identical ORF sequences and de-duplicate them.
usage: group_ORF_sequences.py FAA_FILENAME OUTPUT_PREFIX [--is_pbid]
Use the optional parameter --is_pbid
only if your .faa file have sequence IDs that have the format PB.X.Y
.
For example, if you run:
group_ORF_sequences.py test.faa test.dedup --is_pbid
This will generate two output files, test.dedup.faa
and test.dedup.group.txt
. The .group.txt
file will tell you which sequences from the input file have identical ORFs.