-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading a BAM file uses up all available CPU cores #1195
Comments
Hi David, thanks for reporting back to us so much! It really helps to get rid of a lot of bugs when actual use cases arise. The heavy CPU load is expected, you are not doing anything wrong. Since our files behave as ranges, they have to buffer the first record as soon as they are initialized. For SAM/BAM this is only the header and the first line. For a genome file this probably reads in chromosome 1, which I guess is reasonable to take roughly 30s.
First, 800% is actually great :D thanks for reporting. You are probably reading in a BAM file? Because decompression is automatically multithreaded. In my (probably outdated) benchmarks, I only reached about 600% CPU load on 8 cores. I am at a loss though why initializing both takes more resources than the individual commands, I can only speculate that this might be because of concurrent IO behaviour of the system. |
Hi Svenja, thanks for your quick reply. Now it makes sense that the initialization of the sequence file takes a while. Thanks for the explanation. The 800% CPU load are still weird in my opinion. When I print a debug message after each initialization, I see that the alignment file (yes, it's a BAM) is initialized in no time (as would be expected because header + first record are short). The initialization of the sequence file (a simple FASTA) takes ~45s and puts full load on all 8 cores. This cannot be explained by decompression because the FASTA is not compressed. Even weirder is that if I change the order of the two initializations (i.e. the sequence file gets initialized first) the sequence file uses only 1 core and takes ~30s. So it looks as if there is something strange happening when the alignment file is initialized before the sequence file. Can you reproduce this and do you have any idea what might be going on? |
Hi David, you are right this is unexpected behaviour. int main(int /**/, char ** argv)
{
sequence_file_input sin{argv[1]};
alignment_file_input ain{argv[2]};
} |
int main(int /**/, char ** argv)
{
alignment_file_input ain{argv[2]};
sequence_file_input sin{argv[1]};
}
Scoping solves this: int main(int /**/, char ** argv)
{
{
alignment_file_input ain{argv[2]};
}
sequence_file_input sin{argv[1]};
}
GCC9
This needs to be investigated further although I don't if I'll time for this. |
I have difficulties reproducing this, for me the program always immediately exits (as expected because only one record is buffered). Can you provide the files that trigger this behaviour? |
Hi Hannes, I generated a tiny BAM file which triggers the behaviour on my Linux machine: https://ws.molgen.mpg.de/ws/635147/test.bam As reference FASTA, I used
I observe a difference between Release and Debug mode, though. In Debug mode ( |
@h-2 I said I look into the issues we have with the queue and see if it might be related. But there were other things that needed to be done before. |
maybe related to #1081 |
@eldariont can you again give me the bam file. I am finally looking into it. Sorry for the long delay. |
Ok, I cannot validate the issue anymore. Note, that in the meantime we already fixed the issue of loading the first record on initialisation of the file. Thus, with the first call of begin the first record is being fetched and not on construction of the file. Can you verify that this issue is also fixed for your case? |
That means if you change the program to: int main(int /**/, char ** argv)
{
sequence_file_input sin{argv[1]};
alignment_file_input ain{argv[2]};
std::ranges::begin(sin);
std::ranges::begin(ain);
} This might be still a problem? |
@rrahn Sorry for the long delay. In case you still need the bam file: https://ws.molgen.mpg.de/ws/921352/test.bam |
@marehr Yes, I can still run into exactly the same behavior if I call
|
Ok thanks. I am looking into it then. With another bam file I couldn’t reproduce this. Sent with GitHawk |
Mhmm, ok I really can't reproduce this. Neither on macOS nor on linux. Here are some timings in release mode fasta then bam
bam then fasta
only fasta
only bam
|
I do not know what you are testing. But I can still reproduce this. 👀
With this snippet: #include <seqan3/argument_parser/all.hpp> // includes all necessary headers
#include <seqan3/core/debug_stream.hpp> // our custom output stream
#include <seqan3/std/filesystem> // use std::filesystem::path
#include <seqan3/io/sequence_file/all.hpp> // FASTA support
#include <seqan3/io/alignment_file/all.hpp> // SAM/BAM support
#include <chrono>
int main(int /**/, char ** argv)
{
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
seqan3::sequence_file_input sin{argv[1]};
seqan3::alignment_file_input ain{argv[2]};
std::ranges::begin(sin);
std::ranges::begin(ain);
end = std::chrono::system_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() << std::endl;
} the bam file from David and the fasta file |
@marehr @eseiler : Unfortunately, this problem still seems to persist. I didn't retry with FASTA files but observed the extreme CPU load in iGenVar when it is reading a BAM file. iGenVar actually consumes all available cores (>80) and I have no idea what it is doing with that compute power 😄 I updated the description of this issue at the top now with a new minimal example and a new test file. |
BAM will use all threads by default. You should be able to change that via https://docs.seqan.de/seqan/3-master-user/cookbook.html#setting_compression_threads So this is expected. The weird behaviour with also having a FASTA file was not expected. |
Thanks, for some reason I forgot that using all threads is the default 🤦 Sorry, I should not have changed the issue title and description then. However, I agree with you that using all cores should not be the default (particularly in a research setting where the machines have lots of cores). I think it would be more intuitive to use only one core by default. But in any case we need to set the number of threads in iGenVar so that I can run it on one our servers without annoying other users 😄 |
This is fixed in #2911 the new default is |
EDIT: Updated problem description on May 21
Hi,
while testing seqan/iGenVar on a real-world bam file I observed extreme CPU load caused by reading the BAM file.
Platform
Description
How to repeat the problem
Test input: download
The following minimal example should just open the given test BAM file and print all read names in the file. This is the code:
Expected behaviour
When running this code, I would expect it to print the read names and produce approximately 100% CPU load.
Actual behaviour
The program is printing the read names alright but completely uses up all available CPUs (visible in htop or below).
/usr/bin/time -v ./build/test_bam_load test.bam
Cheers
David
The text was updated successfully, but these errors were encountered: