-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map GTF to memory #755
Map GTF to memory #755
Conversation
I wonder how it would be for the worst case samples that I was seeing that was 32G+ in memory usage (hence had to switch to running only 1 sample on 1 F32...). Would it be a -8G decrease or actually halving? |
In general the memory usage should reduced for 8GB. For the 32 GB case, did you see it before or after #739 ? This fix should have reduced a big amount of memory |
sorry what are the expect outputs from |
The output of |
Description
Instead of reading the entire GTF file into memory, we now create a map of gene/transcript ID and the location in the file (as a pointer) and only keep the map in memory. This reduces memory usage of callVariant significantly. Peak memory of callVairant is now a little over 8GB. For more complex samples, 12 GB of memory should be save. This means we can run more samples per node simultaneously (probably double it).
One thing to note is the index files are now different. There are not two additional files *_gene.idx and *_tx.idx. Also the GTF file now has to be uncompressed.
8GB peak memory usage is still kind of big but there is still room for optimization. The next big part of memroy usage is the genome, which is now still read into memory, which can be optimized into a memory-mapped file, too.
Closes #371
Checklist
.png
, .jpeg
),.pdf
,.RData
,.xlsx
,.doc
,.ppt
, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.CHANGELOG.md
under the next release version or unreleased, and updated the date.