Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map GTF to memory #755

Merged
merged 18 commits into from
Jun 29, 2023
Merged

Map GTF to memory #755

merged 18 commits into from
Jun 29, 2023

Conversation

zhuchcn
Copy link
Member

@zhuchcn zhuchcn commented Jun 29, 2023

Description

Instead of reading the entire GTF file into memory, we now create a map of gene/transcript ID and the location in the file (as a pointer) and only keep the map in memory. This reduces memory usage of callVariant significantly. Peak memory of callVairant is now a little over 8GB. For more complex samples, 12 GB of memory should be save. This means we can run more samples per node simultaneously (probably double it).

One thing to note is the index files are now different. There are not two additional files *_gene.idx and *_tx.idx. Also the GTF file now has to be uncompressed.

8GB peak memory usage is still kind of big but there is still room for optimization. The next big part of memroy usage is the genome, which is now still read into memory, which can be optimized into a memory-mapped file, too.

Closes #371

Checklist

  • This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
  • This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
  • I have read the code review guidelines and the code review best practice on GitHub check-list.
  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
  • I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
  • All test cases passed locally.

@zhuchcn zhuchcn requested a review from lydiayliu June 29, 2023 05:56
@lydiayliu
Copy link
Collaborator

lydiayliu commented Jun 29, 2023

I wonder how it would be for the worst case samples that I was seeing that was 32G+ in memory usage (hence had to switch to running only 1 sample on 1 F32...). Would it be a -8G decrease or actually halving?

@zhuchcn
Copy link
Member Author

zhuchcn commented Jun 29, 2023

In general the memory usage should reduced for 8GB. For the 32 GB case, did you see it before or after #739 ? This fix should have reduced a big amount of memory

@zhuchcn zhuchcn merged commit f107557 into main Jun 29, 2023
@zhuchcn zhuchcn deleted the czhu-fix-index branch June 29, 2023 20:20
@lydiayliu
Copy link
Collaborator

sorry what are the expect outputs from generateIndex now after this update? Is the vignette updated?

@zhuchcn
Copy link
Member Author

zhuchcn commented Jul 5, 2023

The output of generateIndex is changed but that's not something that users should worry about. As long as the correct --index-dir is given, all index files should be loaded correctly. We didn't mention what files are generated by generateIndex in vignette, but might be good to update the doc page for generateIndex itself. I'll push to #762 shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

generate index overwrite and keep on disk
2 participants