-
Notifications
You must be signed in to change notification settings - Fork 251
08. prefetch and fasterq dump
The combination of prefetch
+ fasterq-dump
is the fastest way to extract FASTQ-files from SRA-accessions. The prefetch
tool downloads all necessary files to your computer. The prefetch
- tool can be invoked multiple times if the download did not succeed. It will not start from the beginning every time; instead, it will pick up from where the last invocation failed.
After the download, you have the option to test the downloaded data with the vdb-validate
tool. After the successful download, there is no need for network-connectivity. You can move the folder created by prefetch
to a different location to perform the conversion to the fastq-format somewhere else (for instance to a compute-cluster without internet access).
There are a couple of points here:
- The
prefetch
-tool downloads to a directory named by accession. E.g.prefetch SRR000001
will create a directory namedSRR000001
in the current directory. Make sure that if you move theSRR000001
directory, you don't rename it as the conversion-tool will need to find the original directory. - If you don't have internet access - run
vdb-config -i
and make sure thatEnable Remote Access
is not checked.
This will depend on the configuration of the toolkit. There are 3 options:
- in the current working directory
- in the user-repository
- user-defined location
You can choose between options 1 and 2 with the vdb-config
- tool:
$vdb-config --prefetch-to-cwd
$vdb-config --prefetch-to-user-repo
An alternative way is to use the interactive mode of the 'vdb-config' - tool:
$vdb-config -i
This will show a screen where you can make your selection on the 'TOOLS'-page.
The 3rd option is applied directly to the 'prefetch' - tool itself:
-
$prefetch SRR000001 -O /path/to/be/used
Make sure the last directory of /path/to/be/used is the accession itself. E.g.prefetch SRR000001 -O /path/to/be/used/SRR000001
SRA tools expect all files of runSRR000001
are stored in directory having the same name as accession:SRR000001
. It is called "Accession as Directory".
The prefetch
tool has a default maximum download-size of 20G
. If requested accession is bigger than 20G
, you need to lift that limit. You can specify a high limit no matter how big the requested accession is, or you can query the accession-size using vdb-dump
tool and --info
option. For instance vdb-dump SRR000001 --info
tells you ( among other information ) how big this accession is. Accession SRR000001
has 932,308,473
bytes, which is below the default limit. No action is necessary here. Accession SRR1951777
has 410,112,373,995
bytes. To download this accession you have to lift the limit above that size:
$prefetch SRR1951777 --max-size 420000000000
You can specify the limit in:
- kilobytes (default): --max-size 10 == --max-size 10k : 10 kilobytes,
- megabytes: --max-size 10m : 10 megabytes,
- gigabytes: --max-size 10g : 10 gigabytes,
- terabytes: --max-size 10t : 10 terabytes,
- unlimited: --max-size u.
Before you perform the extraction, you should make a quick estimation about the hard-drive space required. The final fastq-files will be approximately 7 times the size of the accession. fasterq-dump
needs temporary space ( scratch space ) during the conversion of about 1.5 times the amount of the fastq-files. Overall the space you need during the conversion is approximately 17 times the size of the accession. You can check how much space you have by running $df -h .
. Under the 4th column ( Avail
), you see the amount of space you have available. Please take into account that here might be quotas set by your administrator, which are not always visible.
The simplest way to run fasterq-dump
is:
$fasterq-dump SRR000001
This assumes that you have previously 'prefetched' the accession into the current working directory. If directory SRR000001
is not there - the tool will try to access the accession over the network. This will be much slower, and might eventually fail due to network timeouts.
Notice that you use accession as command line argument. The tool will use the current directory as scratch-space. It will put the output-files into the current working directory. When finished to tool will delete all temporary files it created. You have now 3 files in your working directory:
SRR000001.fastq
SRR000001_1.fastq
SRR000001_2.fastq
If you want to have output files created in a different directory - use --outdir
option.
fasterq-dump
performed split-3
operation by default. fasterq-dump
is not identical to the former fastq-dump
regarding command line-options. Here is a short comparison between fastq-dump
and fasterq-dump
:
split-3
$fastq-dump SRR000001 --split-3 --skip-technical
$fasterq-dump SRR000001
split-spot
$fastq-dump SRR000001 --split-spot --skip-technical
$fasterq-dump SRR000001 --split-spot
split-files
$fastq-dump SRR000001 --split-files --skip-technical
$fasterq-dump SRR000001 --split-files
concatenated
$fastq-dump SRR000001
$fasterq-dump SRR000001 --concatenate-reads --include-technical
Here are more important differences to fastq-dump
:
- The
-Z|--stdout
option does not work forsplit-3
andsplit-files
. The tool will fall back to producing files in these cases. - There is no
--gzip|--bizp2
option. You have to compress your files explicitly after they have been written. - There is no
-A
option for the accession, just specify the accession or a path directly. The tool will extract the name from them. -
fasterq-dump
does not take multiple accessions, just one. - There is no
-N|--minSpotId
and no-X|--maxSpotId
option. The tool processes always the whole accession.
By default prefetch <accession>
will download <accession>
run file and its dependencies into <accession>
directory.
E.g., prefetch SRR000001
will create directory SRR000001
in the current directory.
If prefetch
fails - run the same prefetch
command again - download will resume.
Running prefetch <accession>
when <accession>
is existing <accession>
directory will download missed reference sequence files into <accession>
directory.
Currently there is no way to download missed vdbcache
file - it is needed to speed up dumping <accession>
for some runs.
If vdbcache
is available remotely - it will be used.
If there is no internet access and vdbcache
exists for <accession>
- dumping <accession>
will take very long.
By default run fasterq-dump [options] <accession>
in the same directory where you ran prefetch <accession>
.
Fastq files will be created in the current directory.
Use --outdir
option if you want them to be created in a different direcotry.
If you need to move result of prefetch <accession>
download - move entire <accession>
directory. Don't rename it.
Then cd
to the parent directory of <accession>
directory and run fasterq-dump
dump there.
If you prefetched all files and don't have internet access - run vdb-config -i
and turn off Remote Access
.