-
Notifications
You must be signed in to change notification settings - Fork 251
Download On Demand
The majority of sra-tools have the ability to locate and download data from the NCBI SRA on-demand, removing the need for a separate download step, and most importantly downloading only the data that are required. This feature can reduce the bandwidth, storage, and time taken to perform tasks that use less than 100% of the data contained in a run.
sra-tools utilize VDB name resolution, enabling them to accept simple accessions as parameters instead of filesystem paths. The VDB name resolver will generate URLs into the NCBI SRA for any object not found locally, allowing the object to be opened and retrieved over https.
$ prefetch SRR000001
2016-12-01T15:51:52 prefetch.2.8.0: 1) Downloading 'SRR000001'...
2016-12-01T15:51:52 prefetch.2.8.0: Downloading via http...
2016-12-01T15:52:22 prefetch.2.8.0: 1) 'SRR000001' was downloaded successfully
This demonstrates using prefetch
to download a run, in this case over https. [NB - the tool still states that it is using http even though it may be using https. This is a cosmetic defect and will be fixed in the next release.] For higher throughput, Aspera downloads can be used if installed on your system.
The actual file has been downloaded to a cache area in your filesystem:
$ srapath SRR000001
/home/you/ncbi/public/sra/SRR000001.sra
The run file is compressed, occupying about 311M on disk:
$ ls -l /home/you/ncbi/public/sra/SRR000001.sra
-rw-rw-r-- 1 you you 325788509 2014-11-19 16:45 /home/you/ncbi/public/sra/SRR000001.sra
Now convert to fastq (NOTE - runs downloaded with prefetch are now located by accession):
$ sff-dump SRR000001
Read 470985 spots for SRR000001
Written 470985 spots for SRR000001
This run contains 454 data with signals. Here it is in SFF format (about 746M):
$ ls -l SRR000001.sff
-rw-rw-r-- 1 you you 782054672 2014-11-19 16:59 SRR000001.sff
In this example, the run was first downloaded using prefetch
and stored in the user's public cache. Next, the run was converted into SFF, passing only the simple accession as an argument, but all data were read from cache.
$ cache-mgr --report
-----------------------------------
0 cached file(s)
1 complete file(s)
325,788,509 bytes in cached files
325,788,509 bytes used in cached files
0 lock files
Here, we've checked the contents of our cache. It tells us that there are no partially cached files, 1 complete file (our SRR000001.sra from example 1), and the corresponding bytes. The file was completely downloaded by prefetch
.
Let's clear the cache entirely:
$ cache-mgr --clear
-----------------------------------
1 files removed
0 directories removed
325,788,509 bytes removed
Now, we can run fastq-dump
on the accession without prior download. To verify that the run will be found remotely, we can use srapath
to tell us where the complete object is located:
$ srapath SRR000001
https://sra-download.ncbi.nlm.nih.gov/srapub/SRR000001
We see that the path is now remote. Let's convert on-the-fly:
$ fastq-dump SRR000001
Read 470985 spots for SRR000001
Written 470985 spots for SRR000001
Looking at the fastq file, we can see it is complete:
$ ls -l SRR000001.fastq
-rw-rw-r-- 1 you you 301196578 2014-11-19 17:17 SRR000001.fastq
$ wc -l SRR000001.fastq
1883940 SRR000001.fastq
$ expr 1883940 / 4
470985
Notice that the fastq is slightly smaller than the original SRA file. This is due to the fact that this SRA file also carries 454 signal and clipping data, as well as inlined linker sequences that are not used by fastq. (This is true for all data submitted as SFF.)
Let's look again at the cache contents:
$ cache-mgr --report
-----------------------------------
1 cached file(s)
0 complete file(s)
325,788,832 bytes in cached files
121,351,760 bytes used in cached files
0 lock files
The report tells us that there is 1 partially cached file, and no complete files. This is because fastq-dump
only needs read names, read sequences, and qualities. In this case, the amount of data cached is shown as 121,351,760 bytes, instead of the full 325,788,832 contained in the original SRR.