-
Notifications
You must be signed in to change notification settings - Fork 6
How to choose the appropriate genome coverage for Transposome
An important consideration for annotating repeats with any method is choosing the appropriate level of genome coverage. In many cases, as with non-model species, you won't know the genome size, and that's okay as long as you have an ample number of reads to sample (i.e., a few hundred thousand).
As the graphs show (link to publication to be added soon), it isn't necessary to run transposome
with something like 10X or even 1X coverage, so it is crucial to down sample your data. Also, running transposome
with very high coverage may generate very large BLAST files (several hundred GB is some cases), so be warned. Downsampling will make transposome
run much faster and save you some major issues because very complex genomes will you eat up all your computer's memory (if your coverage is too high), no matter how much you have available. This is an issue in the Community clustering code, and I'm looking into that issue (technically, the program won't use all the memory, it just can't allocate it).
Here is a simple script to sample your data with Transposome (let's call it sample.pl):
#!/usr/bin/env perl
use strict;
use warnings;
use Transposome::SeqUtil;
my $usage = "$0 seqfile.fas num > seqfile_sampled.fas\n";
my $infile = shift or die $usage;
my $sample_size = shift or die $usage;
my $sequtil = Transposome::SeqUtil->new( file => $infile, sample_size => $sample_size , no_store => 1);
$sequtil->sample_seq;
For example, if you want 500,000 sequences from the file myseqs.fas, you could execute this script as follows:
perl sample.pl myseqs.fas 500000 > myseqs_500k.fas
The file myseqs_500k.fas would then contain 500,000 sequences randomly sampled, with equal probability, from you input file
You could also accomplish the same with a one-liner. Here is an example:
perl -MTransposome::SeqUtil -e 'Transposome::SeqUtil->new( file => shift, sample_size => 500_000, no_store => 1)->sample_seq' myseqs.fas > myseqs_500k.fas
Note that there is also a script called sample_reads.pl provided in the transposome-scripts repository for sampling reads.
Transposome works best with paired-end sequence data, so you need to sample each pair separately and then interleave them. For example,
perl -MTransposome::SeqUtil -e 'Transposome::SeqUtil->new( file => shift, sample_size => 500_000, no_store => 1)->sample_seq' myseqs_1.fas > myseqs_500k_1.fas
and
perl -MTransposome::SeqUtil -e 'Transposome::SeqUtil->new( file => shift, sample_size => 500_000, no_store => 1)->sample_seq' myseqs_2.fas > myseqs_500k_2.fas
Then, interleave the two files (as you would for assembly) and specify the name of the file in the Transposome configuration (or during object construction, if you are writing you own code).