-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use larger blocksize with bwa index on >100MB files #13
Conversation
customBlockSize :: FilePath -> IO [String] | ||
customBlockSize path = sizeAsParam <$> getFileSize path | ||
|
||
sizeAsParam :: FileOffset -> [String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here with the same information as you have in the PR (from my POV, feel free to copy & paste):
BWA's default indexing parameters are quite conservative. This leads to
a small memory footprint at the cost of more CPU hours.
With large databases (~100GB) default settings require over 2 weeks of
CPU time. Increasing the default blocksize will increase the memory
footprint but will reduce indexing time 3 to 6 fold.
This patch increases the blocksize to roughly 1/10th of the filesize.
The memory footprint should be about the size of the database.
As per lh3/bwa#104 this patch may become
obsolete once this functionality is built into bwa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done but tests still failing. Will check what's wrong with the tests later.
BWA's default indexing parameters are quite conservative. This leads to a small memory footprint at the cost of more CPU hours. With large databases (~100GB) default settings require over 2 weeks of CPU time. Increasing the default blocksize will increase the memory footprint but will reduce indexing time 3 to 6 fold. This patch increases the blocksize to roughly 1/10th of the filesize. The memory footprint should be about the size of the database. As per lh3/bwa#104 this patch may become obsolete once this functionality is built into bwa.
Tests fixed. Also included a few other commits that made it easier to debug what was going on with TravisCI. |
Thanks. Merging! |
This reduces CPU time required for indexing drastically at the expense of higher memory usage.
The same machine that will run
bwa mem
should be able to accommodate this without problems.