Skip to content

08. prefetch and fasterq dump

Andrew Klymenko edited this page Dec 22, 2020 · 9 revisions

How to use prefetch and fasterq-dump to extract FASTQ-files from SRA-accessions

The combination of prefetch + fasterq-dump is the fastest way to extract FASTQ-files from SRA-accessions. The prefetch tool downloads all necessary files to your computer. The prefetch - tool can be invoked multiple times, if download did not succeed. It will not start from the beginning every time, it will pick up from where the last invocation failed.

After the download, you can optionally test the data with the vdb-validate tool. After the download succeeded there is no need for network-connectivity any more. You can move the folder created by prefetch to a different location to perform the conversion to the fastq-format somewhere else ( for instance to a compute-cluster without internet access ).

There are a couple of points here:

  • prefetch download to a directory named by accession. E.g. prefetch SRR000001 will created directory named SRR000001 in current directory. Make sure you move SRR000001 directory and don't rename it.
  • If you don't have internet access - run vdb-config -i and make sure Enable Remote Access is not checked.

Into what location will the prefetch save downloaded files?

That depends on the configuration of the toolkit. There are 3 options:

  1. in the current working directory
  2. in the user-repository
  3. user-defined location

You can select between options 1 and 2 with the vdb-config - tool:

  • $vdb-config --prefetch-to-cwd
  • $vdb-config --prefetch-to-user-repo

An alternative way is to use the interactive mode of the 'vdb-config' - tool

  • $vdb-config -i

This will show a screen where you can make your selection on the 'TOOLS'-page.

The 3rd option is applied directly to the 'prefetch' - tool itself:

  • $prefetch SRR000001 -O /path/to/be/used Make sure the last directory of /path/to/be/used is the accession itself. E.g. prefetch SRR000001 -O /path/to/be/used/SRR000001 SRA tools expect all files of run SRR000001 are stored in directory having the same name as accession: SRR000001. It is called "Accession as Directory".

Check the maximum-size limit of the 'prefetch' tool

The prefetch tool has a default maximum download-size of 20G. If requested accession is bigger than 20G, you need to lift that limit. You can specify a high limit no matter how big the requested accession is, or you can query the accession-size using vdb-dump tool and --info option. For instance vdb-dump SRR000001 --info tells you ( among other information ) how big this accession is. Accession SRR000001 has 932,308,473 bytes, which is below the default limit. No action is necessary here. Accession SRR1951777 has 410,112,373,995 bytes. To download this accession you have to lift the limit above that size:

  • $prefetch SRR1951777 --max-size 420000000000

You can specify the limit in:

  • kilobytes (default): --max-size 10 == --max-size 10k : 10 kilobytes,
  • megabytes: --max-size 10m : 10 megabytes,
  • gigabytes: --max-size 10g : 10 gigabytes,
  • terabytes: --max-size 10t : 10 terabytes,
  • unlimited: --max-size u.

Extract fastq-file(s) from SRA - accession

Before you perform the extraction, you should make a quick estimation about the hard-drive space required. The final fastq-files will be approximately 7 times the size of the accession. fasterq-dump needs temporary space ( scratch space ) during the conversion of about 1.5 times the amount of the fastq-files. Overall the space you need during the conversion is approximately 17 times the size of the accession. You can check how much space you have by running $df -h .. Under the 4th column ( Avail ), you see the amount of space you have available. Please take into account that here might be quotas set by your administrator, which are not always visible.

The simplest way to run fasterq-dump is:

  • $fasterq-dump SRR000001

This assumes that you have previously 'prefetched' the accession into the current working directory. Notice that you use accession as command line argument. The tool will use the current directory as scratch-space. It will put the output-files into the current working directory. When finished to tool will delete all temporary files it created. You have now 3 files in your working directory:

  • SRR000001.fastq
  • SRR000001_1.fastq
  • SRR000001_2.fastq

fasterq-dump performed split-3 operation by default. fasterq-dump is not identical to the former fastq-dump regarding command line-options. Here is a short comparison between fastq-dump and fasterq-dump:

split-3

  • $fastq-dump SRR000001 --split-3 --skip-technical
  • $fasterq-dump SRR000001

split-spot

  • $fastq-dump SRR000001 --split-spot --skip-technical
  • $fasterq-dump SRR000001 --split-spot

split-files

  • $fastq-dump SRR000001 --split-files --skip-technical
  • $fasterq-dump SRR000001 --split-files

concatenated

  • $fastq-dump SRR000001
  • $fasterq-dump SRR000001 --concatenate-reads --include-technical

Here are more important differences to fastq-dump:

  • The -Z|--stdout option does not work for split-3 and split-files. The tool will fall back to producing files in these cases.
  • There is no --gzip|--bizp2 option. You have to compress your files explicitly after they have been written.
  • There is no -A option for the accession, just specify the accession or a path directly. The tool will extract the name from them.
  • fasterq-dump does not take multiple accessions, just one.
  • There is no -N|--minSpotId and no -X|--maxSpotId option. The tool processes always the whole accession.