-
Notifications
You must be signed in to change notification settings - Fork 3
{AH} break infinite loop in certain files #7
Conversation
Hi Andreas, Thanks for the bug report and fix. I am on holidays at the moment so won't be able to examine the pull request in detail for ~2 weeks. I am a little worried that this problem is triggered by a bug somewhere upstream in cheers, Jared |
Many thanks, enjoy your holidays! |
Taking a look at this now. I am having difficulty recreating the issue. I know you probably can't share data, but could you sketch out what the files look like? I think what is happening is you have some chunk files like:
and this indeed causes an infinite loop. I think this is actually a bug upstream so I would like to fix it during the ingestion process (since it might be causing silent errors elsewhere) rather than the genotyping procedure. |
Thanks, I have been on the road. These are the commands I run on the unpatched version, but setting the DEBUG level to 5:
This is where it hangs in strace:
Is this sufficient? When I debugged the issue I did find the situation you describe above. Unfortunately I have already deleted intermediate files and have only the blocks. Interestingly, not all blocks or chromosomes have this issue, for example the following works:
|
Thanks for the detailed info. Were the input files for this whole genome GVCFs or had they been pre-processed in someway? Just been discussing this in #9 and a user has encountered the same bug. In this case, it was triggered by the input GVCFs being sliced into smaller regions beforehand. |
Hi @jaredo , the gVCFs were not sliced beforhand, but each block of 500 contained the full genome. The genotyping step was done per chromosome, but the data sets themselves were not sliced. Best wishes, |
hmm...is it possible that any of the input GVCFs had rows with POS>1 for the first line of a chromosome? My concern is that if the dpt file did not flank the variants at the start of the file, then you will not get correct DP/GQ information at the non-ALT samples. So I think this is a bug in I think my interim solution will be to check that the .dpt file entirely flanks the variants in the .bcf file, if this is not the case, I will throw an informative error. |
Thanks, I will check if that is the case. |
Yes, indeed, the gVCFs did not start at 1, however none of them do.Interestingly, the chromosomes with issue (13, 14, 21) all have a late start, but not exclusively so (15, 21), please see the attached plot. However, they are not exact: block0.bcf (a working bcf file)
block2500.bcf (a problematic bcf file)
In fact, all the problematic chromosomes have multiple start locations:
|
Ah. Are these Illumina GVCFs? This tool only works on files generated by Illumina's variant calling pipeline. Unfortunately there are quite a variety of GVCFs produced by different tools with different formatting and I cannot handle them all. We just discussed this in #10 as well. I am going to add some sanity checks to try and prevent these problems earlier on. I have also tried to make this clearer the README. |
Thanks! Alas, they are not Illumina's gVCFs. I apologize, I should have said so at the start. We do got quite far with the agg tool, but of course I understand that you can't support all gVCFs. Many thanks for your time! |
Sorry about that! FYI GATK also has its own GVCF merging/genotyping pipeline specific for its format. |
Hi, many thanks for this tool. I noticed when merging files per chromosome, that for 3 out of 22 chromosomes the tool would enter an infinite loop. This seemed to be the case when at the start of a file a record was only present in the .bcf file of a particular block, but not in the .dpt file.
The attached fix exists the loop but I am uncertain if it fixes the problem properly and/or in the most efficient manner.
Best wishes