Map GTF to memory #755

zhuchcn · 2023-06-29T05:56:20Z

Description

Instead of reading the entire GTF file into memory, we now create a map of gene/transcript ID and the location in the file (as a pointer) and only keep the map in memory. This reduces memory usage of callVariant significantly. Peak memory of callVairant is now a little over 8GB. For more complex samples, 12 GB of memory should be save. This means we can run more samples per node simultaneously (probably double it).

One thing to note is the index files are now different. There are not two additional files *_gene.idx and *_tx.idx. Also the GTF file now has to be uncompressed.

8GB peak memory usage is still kind of big but there is still room for optimization. The next big part of memroy usage is the genome, which is now still read into memory, which can be optimized into a memory-mapped file, too.

Closes #371

Checklist

This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
All test cases passed locally.

…on on disk

…tf index

… GTF file only

lydiayliu · 2023-06-29T16:28:09Z

I wonder how it would be for the worst case samples that I was seeing that was 32G+ in memory usage (hence had to switch to running only 1 sample on 1 F32...). Would it be a -8G decrease or actually halving?

zhuchcn · 2023-06-29T16:34:03Z

In general the memory usage should reduced for 8GB. For the 32 GB case, did you see it before or after #739 ? This fix should have reduced a big amount of memory

moPepGen/cli/generate_index.py

test/files/fusion/star_fusion.gvf

moPepGen/gtf/GtfIO.py

moPepGen/gtf/GTFPointer.py

lydiayliu · 2023-07-04T18:36:10Z

sorry what are the expect outputs from generateIndex now after this update? Is the vignette updated?

zhuchcn · 2023-07-05T16:16:16Z

The output of generateIndex is changed but that's not something that users should worry about. As long as the correct --index-dir is given, all index files should be loaded correctly. We didn't mention what files are generated by generateIndex in vignette, but might be good to update the doc page for generateIndex itself. I'll push to #762 shortly

zhuchcn and others added 12 commits June 23, 2023 18:36

fix (GenomicAnnotationOnDisk): class created to keep genomic annotati…

8d24b62

…on on disk

fix (GenomicAnnotationOnDisk): added index_gtf function

53ea8dd

fix (GenomicAnnotationOnDisk): added loader function from gtf idx

39b9c6b

fix (GenomicAnnotationOnDisk): check protein coding when generating g…

dc149ca

…tf index

fix (GenerateIndex): moving to use gene and tx index

d435bca

fix (moPepGen): test cases passed

81765b5

fix (moPepGen): docstrings added

9c75b60

fix (moPepGen): update version to 1.1.0

79771ac

fix (generateIndex): trying to set compression to none

ff4ba9f

fix (generateIndex): remove code to compress the copied gtf file

b1ac30c

style (moPepGen): fix pylint, added docstrings

19c682e

doc (CHANGELOG): changelog added

80de41c

zhuchcn requested a review from lydiayliu June 29, 2023 05:56

zhuchcn added 2 commits June 29, 2023 09:08

style (test)): docstrings added

f1554dd

fix (GenomicAnnotationOnDisk): gtf file handle should be uncompressed…

8ea2e51

… GTF file only

fix (test): asserts added to GenomicAnnotationOnDisk

60822b4

zhuchcn added 2 commits June 29, 2023 09:49

fix (test): versions of gtf idx files dont match with the github action

9a92ecf

fix (GenomicAnnotationOnDisk): remove unused import

8685d08

lydiayliu approved these changes Jun 29, 2023

View reviewed changes

moPepGen/cli/generate_index.py Show resolved Hide resolved

test/files/fusion/star_fusion.gvf Outdated Show resolved Hide resolved

moPepGen/gtf/GtfIO.py Show resolved Hide resolved

moPepGen/gtf/GTFPointer.py Outdated Show resolved Hide resolved

fix (gtf): created GTFSourceInferrer class

2945a19

zhuchcn merged commit f107557 into main Jun 29, 2023

zhuchcn deleted the czhu-fix-index branch June 29, 2023 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map GTF to memory #755

Map GTF to memory #755

zhuchcn commented Jun 29, 2023

lydiayliu commented Jun 29, 2023 •

edited

Loading

zhuchcn commented Jun 29, 2023

lydiayliu commented Jul 4, 2023

zhuchcn commented Jul 5, 2023

Map GTF to memory #755

Map GTF to memory #755

Conversation

zhuchcn commented Jun 29, 2023

Description

Checklist

lydiayliu commented Jun 29, 2023 • edited Loading

zhuchcn commented Jun 29, 2023

lydiayliu commented Jul 4, 2023

zhuchcn commented Jul 5, 2023

lydiayliu commented Jun 29, 2023 •

edited

Loading