You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there a general strategy for simultaneous random access to GZ blocks in an indexed gz file?
For large files, i know the virtual positions and buffer sizes i want to access and so when these are totally disjoint, it should be a nice speed-up to read these via multiple threads. Right now, this can't be done via the Indexed Reader, since the virtual position is a mutable property of the reader. However, I notice that contiguous blocks can be read multi-threaded so, in-principle, why not discontiguous blocks?
thanks!
The text was updated successfully, but these errors were encountered:
Using the sync_file crate, you can clone the file handle into a new reader for each thread. Each clone of the sync file will maintain an independent position for "multi-threaded reads". As I understand it, reading the bytes off the disk is still sequential, but the post disc reading decompression can then happen concurrently, which is the majority of the time spent anyways.
use sync_file::SyncFile;
use rayon::prelude::*;
use noodles::bcf;
use itertools::Itertools;
let f = SyncFile::open(path)?;
let header = {
let mut bcf_r = bcf::Reader::new(f.clone());
bcf_r.read_file_format()?;
header = bcf_r.read_header()?
}
let mut data = (0..).map_while(|i|{
header.contigs().get_index(i)
}).collect_vec()
.into_par_iter()
.for_each(|chrom| {
let mut bcf_r = {
let mut bcf_r = bcf::Reader::new(f.clone());
bcf_r.read_file_format().expect("failed to read format");
let _r_header = bcf_r.read_header().expect("failed to read header");
bcf_r
};
let region = format!("{}", chrom.0).parse().expect("failed to parse region");
let records = bcf_r.query(&header, &index, ®ion)
.expect("failed to query index");
records.for_each(|record| {
do_stuff(record);
});
});
@zaeleus I'd be grateful for your thoughts on this discussion.
I'm using your fantastic crate to teach myself Rust and try to implement a small project where I want to fetch a lot of variants in parallel from big indexed bgzipped VCFs.
The idea described by the author of this issue makes a lot of sense to me. Do you think there are any blockers if I wanted to implement multi-thread random access reads?
Is there a general strategy for simultaneous random access to GZ blocks in an indexed gz file?
For large files, i know the virtual positions and buffer sizes i want to access and so when these are totally disjoint, it should be a nice speed-up to read these via multiple threads. Right now, this can't be done via the Indexed Reader, since the virtual position is a mutable property of the reader. However, I notice that contiguous blocks can be read multi-threaded so, in-principle, why not discontiguous blocks?
thanks!
The text was updated successfully, but these errors were encountered: