Skip to content
This repository has been archived by the owner on Feb 7, 2018. It is now read-only.

{AH} break infinite loop in certain files #7

Closed

Conversation

AndreasHegerGenomics
Copy link

Hi, many thanks for this tool. I noticed when merging files per chromosome, that for 3 out of 22 chromosomes the tool would enter an infinite loop. This seemed to be the case when at the start of a file a record was only present in the .bcf file of a particular block, but not in the .dpt file.

The attached fix exists the loop but I am uncertain if it fixes the problem properly and/or in the most efficient manner.

Best wishes

@jaredo
Copy link
Contributor

jaredo commented May 12, 2016

Hi Andreas,

Thanks for the bug report and fix. I am on holidays at the moment so won't be able to examine the pull request in detail for ~2 weeks.

I am a little worried that this problem is triggered by a bug somewhere upstream in agg ingest. The .dpt file should always start at a position <= the first variant record in the corresponding bcf.

cheers,

Jared

@AndreasHegerGenomics
Copy link
Author

Many thanks, enjoy your holidays!

@jaredo
Copy link
Contributor

jaredo commented May 26, 2016

Taking a look at this now. I am having difficulty recreating the issue. I know you probably can't share data, but could you sketch out what the files look like?

I think what is happening is you have some chunk files like:

#dpt:
chr1    10100   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0
chr1    10101   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0
chr1    10102   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0
chr1    10103   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0

#bcf:
chr1    10086   .   A   T   1   .   .   GT:GQ:GQX:DP:DPF:AD:PF  ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. 0/1:26:2:3:1:2,1:0
chr1    10145   .   A   T   3   .   .   GT:GQ:GQX:DP:DPF:AD:PF  ./.:.:.:.:.:.:. 0/1:31:7:3:8:2,1:0  0/1:27:4:2:4:1,1:0
chr1    10147   .   C   A   3   .   .   GT:GQ:GQX:DP:DPF:AD:PF  ./.:.:.:.:.:.:. 0/1:31:7:3:7:2,1:0  0/1:27:4:2:4:1,1:0
chr1    10161   .   C   A   2   .   .   GT:GQ:GQX:DP:DPF:AD:PF  0/1:28:8:2:1:1,1:0  ./.:.:.:.:.:.:. ./.:.:.:.:

and this indeed causes an infinite loop. I think this is actually a bug upstream so I would like to fix it during the ingestion process (since it might be causing silent errors elsewhere) rather than the genotyping procedure.

@AndreasHegerGenomics
Copy link
Author

Thanks, I have been on the road. These are the commands I run on the unpatched version, but setting the DEBUG level to 5:

rm -rf test.bcf; agg genotype --thread 4 --output-file test.bcf --output-type b -r 13 block_2500.bcf
block_2500.bcf 
n_threads = 4
500 samples
Writing output to test.bcf
rid = 12 (13)
dp_chr = 12    dp_pos = 19020017
(l1,l2) = 1 9
(var_start,var_stop) = (19020000,19020001)
var_pos = (19020000,19020001)  var_type=1

This is where it hangs in strace:

...
lseek(3, 17463342372, SEEK_SET)         = 17463342372
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\275\6\355\235=\253$E\24\206\337\252\256\356\271\313"..., 32768) = 32768

Is this sufficient? When I debugged the issue I did find the situation you describe above. Unfortunately I have already deleted intermediate files and have only the blocks.

Interestingly, not all blocks or chromosomes have this issue, for example the following works:

rm -rf test.bcf; agg genotype --thread 4 --output-file test.bcf --output-type b -r 10 block_2500.bcf
rm -rf test.bcf; agg genotype --thread 4 --output-file test.bcf --output-type b -r 13 block_500.bcf

@jaredo jaredo mentioned this pull request Jun 5, 2016
@jaredo
Copy link
Contributor

jaredo commented Jun 6, 2016

Thanks for the detailed info. Were the input files for this whole genome GVCFs or had they been pre-processed in someway?

Just been discussing this in #9 and a user has encountered the same bug. In this case, it was triggered by the input GVCFs being sliced into smaller regions beforehand.

@AndreasHegerGenomics
Copy link
Author

Hi @jaredo , the gVCFs were not sliced beforhand, but each block of 500 contained the full genome. The genotyping step was done per chromosome, but the data sets themselves were not sliced.

Best wishes,
Andreas

@jaredo
Copy link
Contributor

jaredo commented Jun 7, 2016

hmm...is it possible that any of the input GVCFs had rows with POS>1 for the first line of a chromosome?

My concern is that if the dpt file did not flank the variants at the start of the file, then you will not get correct DP/GQ information at the non-ALT samples. So I think this is a bug in agg ingest...but I can't find it!

I think my interim solution will be to check that the .dpt file entirely flanks the variants in the .bcf file, if this is not the case, I will throw an informative error.

@AndreasHegerGenomics
Copy link
Author

Thanks, I will check if that is the case.

@AndreasHegerGenomics
Copy link
Author

AndreasHegerGenomics commented Jun 7, 2016

Yes, indeed, the gVCFs did not start at 1, however none of them do.Interestingly, the chromosomes with issue (13, 14, 21) all have a late start, but not exclusively so (15, 21), please see the attached plot.
While not being 1, the start points are broadly consistent across all samples:

chromosamal_starts

However, they are not exact:

block0.bcf (a working bcf file)

13      19020018        .       T       <NON_REF>       3       .       .       GT:DP:MIN_DP:GQX:PF     0/0:2:0:.:1     0/0:2:0:.:1
13      19020145        .       G       T       73      .       .       GT:PL:GQX:AD:DP:VAF:PF  ./.:.:.:.:.:.:. ./.:.:.:.:.:.:.

block2500.bcf (a problematic bcf file)

13      19020001        .       G       <NON_REF>       .       .       .       GT:DP:MIN_DP:GQX:PF     ./.:.:.:.:.     ./.:.:.:.:.
13      19020018        .       T       <NON_REF>       .       .       .       GT:DP:MIN_DP:GQX:PF     0/0:0:0:.:1     0/0:2:0:.:1

In fact, all the problematic chromosomes have multiple start locations:

1 [10001]
2 [10001]
3 [60001]
4 [10001]
5 [10001]
6 [60001]
7 [10001]
8 [10001]
9 [10001]
10 [60001]
11 [60001]
12 [60001]
13 [19020018 19020001]
14 [19000564 19000060 19000422 19000368 19000224]
15 [20000001]
16 [60001]
17 [1]
18 [10001]
19 [60001]
20 [60001 60998]
21 [9412179 9411327 9411500 9411410 9411785 9411602 9411318 9412126 9411645
 9412077 9411609 9411194 9411928 9412105 9411303 9411942 9412099 9411998
 9411907 9411732]
22 [16050017]

@jaredo
Copy link
Contributor

jaredo commented Jun 7, 2016

Ah. Are these Illumina GVCFs? This tool only works on files generated by Illumina's variant calling pipeline. Unfortunately there are quite a variety of GVCFs produced by different tools with different formatting and I cannot handle them all.

We just discussed this in #10 as well.

I am going to add some sanity checks to try and prevent these problems earlier on. I have also tried to make this clearer the README.

@AndreasHegerGenomics
Copy link
Author

Thanks! Alas, they are not Illumina's gVCFs. I apologize, I should have said so at the start. We do got quite far with the agg tool, but of course I understand that you can't support all gVCFs. Many thanks for your time!

@jaredo
Copy link
Contributor

jaredo commented Jun 7, 2016

Sorry about that!

FYI GATK also has its own GVCF merging/genotyping pipeline specific for its format.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants