Skip to content

How to choose the appropriate genome coverage for Transposome

Evan Staton edited this page Mar 26, 2015 · 15 revisions

An important consideration for annotating repeats with any method is choosing the appropriate level of genome coverage. In many cases, as with non-model species, you won't know the genome size, and that's okay as long as you have an ample number of reads to sample (i.e., a few hundred thousand).

As the graphs show (link to publication to be added soon), it isn't necessary to run transposome with something like 10X or even 1X coverage, so it is crucial to down sample your data. Also, running transposome with very high coverage may generate very large BLAST files (several hundred GB is some cases), so be warned. Downsampling will make transposome run much faster and save you some major issues because very complex genomes will you eat up all your computer's memory (if your coverage is too high), no matter how much you have available. This is an issue in the Community clustering code, and I'm looking into that issue (technically, the program won't use all the memory, it just can't allocate it).

Here is a simple script to sample your data with Transposome (let's call it sample.pl):

#!/usr/bin/env perl

use strict;
use warnings;
use Transposome::SeqUtil;

my $usage = "$0 seqfile.fas num > seqfile_sampled.fas\n";
my $infile = shift or die $usage;
my $sample_size = shift or die $usage;
my $sequtil = Transposome::SeqUtil->new( file => $infile, sample_size => $sample_size , no_store => 1);
$sequtil->sample_seq;

For example, if you want 500,000 sequences from the file myseqs.fas, you could execute this script as follows:

perl sample.pl myseqs.fas 500000 > myseqs_500k.fas

The file myseqs_500k.fas would then contain 500,000 sequences randomly sampled, with equal probability, from you input file

You could also accomplish the same with a one-liner. Here is an example:

perl -MTransposome::SeqUtil -e 'Transposome::SeqUtil->new( file => shift, sample_size => 500_000, no_store => 1)->sample_seq' myseqs.fas > myseqs_500k.fas

Note that there is also a script called sample_reads.pl provided in the transposome-scripts repository for sampling reads.

Working with paired-end data

Transposome works best with paired-end sequence data, so you need to sample each pair separately and then interleave them. For example,

perl -MTransposome::SeqUtil -e 'Transposome::SeqUtil->new( file => shift, sample_size => 500_000, no_store => 1)->sample_seq' myseqs_1.fas > myseqs_500k_1.fas

and

perl -MTransposome::SeqUtil -e 'Transposome::SeqUtil->new( file => shift, sample_size => 500_000, no_store => 1)->sample_seq' myseqs_2.fas > myseqs_500k_2.fas

Then, interleave the two files (as you would for assembly) and specify the name of the file in the Transposome configuration (or during object construction, if you are writing you own code).