{AH} break infinite loop in certain files #7

AndreasHegerGenomics · 2016-05-12T13:48:11Z

Hi, many thanks for this tool. I noticed when merging files per chromosome, that for 3 out of 22 chromosomes the tool would enter an infinite loop. This seemed to be the case when at the start of a file a record was only present in the .bcf file of a particular block, but not in the .dpt file.

The attached fix exists the loop but I am uncertain if it fixes the problem properly and/or in the most efficient manner.

Best wishes

jaredo · 2016-05-12T16:38:39Z

Hi Andreas,

Thanks for the bug report and fix. I am on holidays at the moment so won't be able to examine the pull request in detail for ~2 weeks.

I am a little worried that this problem is triggered by a bug somewhere upstream in agg ingest. The .dpt file should always start at a position <= the first variant record in the corresponding bcf.

cheers,

Jared

AndreasHegerGenomics · 2016-05-13T08:10:27Z

Many thanks, enjoy your holidays!

jaredo · 2016-05-26T15:55:17Z

Taking a look at this now. I am having difficulty recreating the issue. I know you probably can't share data, but could you sketch out what the files look like?

I think what is happening is you have some chunk files like:

#dpt:
chr1    10100   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0
chr1    10101   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0
chr1    10102   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0
chr1    10103   .   N   .   .   .   .   DP:GQ   1:0 1:0 3:0

#bcf:
chr1    10086   .   A   T   1   .   .   GT:GQ:GQX:DP:DPF:AD:PF  ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. 0/1:26:2:3:1:2,1:0
chr1    10145   .   A   T   3   .   .   GT:GQ:GQX:DP:DPF:AD:PF  ./.:.:.:.:.:.:. 0/1:31:7:3:8:2,1:0  0/1:27:4:2:4:1,1:0
chr1    10147   .   C   A   3   .   .   GT:GQ:GQX:DP:DPF:AD:PF  ./.:.:.:.:.:.:. 0/1:31:7:3:7:2,1:0  0/1:27:4:2:4:1,1:0
chr1    10161   .   C   A   2   .   .   GT:GQ:GQX:DP:DPF:AD:PF  0/1:28:8:2:1:1,1:0  ./.:.:.:.:.:.:. ./.:.:.:.:

and this indeed causes an infinite loop. I think this is actually a bug upstream so I would like to fix it during the ingestion process (since it might be causing silent errors elsewhere) rather than the genotyping procedure.

AndreasHegerGenomics · 2016-06-01T11:01:45Z

Thanks, I have been on the road. These are the commands I run on the unpatched version, but setting the DEBUG level to 5:

rm -rf test.bcf; agg genotype --thread 4 --output-file test.bcf --output-type b -r 13 block_2500.bcf
block_2500.bcf 
n_threads = 4
500 samples
Writing output to test.bcf
rid = 12 (13)
dp_chr = 12    dp_pos = 19020017
(l1,l2) = 1 9
(var_start,var_stop) = (19020000,19020001)
var_pos = (19020000,19020001)  var_type=1

This is where it hangs in strace:

...
lseek(3, 17463342372, SEEK_SET)         = 17463342372
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\275\6\355\235=\253$E\24\206\337\252\256\356\271\313"..., 32768) = 32768

Is this sufficient? When I debugged the issue I did find the situation you describe above. Unfortunately I have already deleted intermediate files and have only the blocks.

Interestingly, not all blocks or chromosomes have this issue, for example the following works:

rm -rf test.bcf; agg genotype --thread 4 --output-file test.bcf --output-type b -r 10 block_2500.bcf
rm -rf test.bcf; agg genotype --thread 4 --output-file test.bcf --output-type b -r 13 block_500.bcf

jaredo · 2016-06-06T18:00:51Z

Thanks for the detailed info. Were the input files for this whole genome GVCFs or had they been pre-processed in someway?

Just been discussing this in #9 and a user has encountered the same bug. In this case, it was triggered by the input GVCFs being sliced into smaller regions beforehand.

AndreasHegerGenomics · 2016-06-07T08:07:58Z

Hi @jaredo , the gVCFs were not sliced beforhand, but each block of 500 contained the full genome. The genotyping step was done per chromosome, but the data sets themselves were not sliced.

Best wishes,
Andreas

jaredo · 2016-06-07T09:50:22Z

hmm...is it possible that any of the input GVCFs had rows with POS>1 for the first line of a chromosome?

My concern is that if the dpt file did not flank the variants at the start of the file, then you will not get correct DP/GQ information at the non-ALT samples. So I think this is a bug in agg ingest...but I can't find it!

I think my interim solution will be to check that the .dpt file entirely flanks the variants in the .bcf file, if this is not the case, I will throw an informative error.

AndreasHegerGenomics · 2016-06-07T10:54:52Z

Thanks, I will check if that is the case.

AndreasHegerGenomics · 2016-06-07T15:17:58Z

Yes, indeed, the gVCFs did not start at 1, however none of them do.Interestingly, the chromosomes with issue (13, 14, 21) all have a late start, but not exclusively so (15, 21), please see the attached plot.
While not being 1, the start points are broadly consistent across all samples:

However, they are not exact:

block0.bcf (a working bcf file)

13      19020018        .       T       <NON_REF>       3       .       .       GT:DP:MIN_DP:GQX:PF     0/0:2:0:.:1     0/0:2:0:.:1
13      19020145        .       G       T       73      .       .       GT:PL:GQX:AD:DP:VAF:PF  ./.:.:.:.:.:.:. ./.:.:.:.:.:.:.

block2500.bcf (a problematic bcf file)

13      19020001        .       G       <NON_REF>       .       .       .       GT:DP:MIN_DP:GQX:PF     ./.:.:.:.:.     ./.:.:.:.:.
13      19020018        .       T       <NON_REF>       .       .       .       GT:DP:MIN_DP:GQX:PF     0/0:0:0:.:1     0/0:2:0:.:1

In fact, all the problematic chromosomes have multiple start locations:

1 [10001]
2 [10001]
3 [60001]
4 [10001]
5 [10001]
6 [60001]
7 [10001]
8 [10001]
9 [10001]
10 [60001]
11 [60001]
12 [60001]
13 [19020018 19020001]
14 [19000564 19000060 19000422 19000368 19000224]
15 [20000001]
16 [60001]
17 [1]
18 [10001]
19 [60001]
20 [60001 60998]
21 [9412179 9411327 9411500 9411410 9411785 9411602 9411318 9412126 9411645
 9412077 9411609 9411194 9411928 9412105 9411303 9411942 9412099 9411998
 9411907 9411732]
22 [16050017]

jaredo · 2016-06-07T15:42:07Z

Ah. Are these Illumina GVCFs? This tool only works on files generated by Illumina's variant calling pipeline. Unfortunately there are quite a variety of GVCFs produced by different tools with different formatting and I cannot handle them all.

We just discussed this in #10 as well.

I am going to add some sanity checks to try and prevent these problems earlier on. I have also tried to make this clearer the README.

AndreasHegerGenomics · 2016-06-07T15:58:26Z

Thanks! Alas, they are not Illumina's gVCFs. I apologize, I should have said so at the start. We do got quite far with the agg tool, but of course I understand that you can't support all gVCFs. Many thanks for your time!

jaredo · 2016-06-07T15:59:21Z

Sorry about that!

FYI GATK also has its own GVCF merging/genotyping pipeline specific for its format.

{AH} break infinite loop in certain files

beeaa0e

jaredo mentioned this pull request Jun 5, 2016

agg runs forever #9

Closed

AndreasHegerGenomics closed this Jun 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{AH} break infinite loop in certain files #7

{AH} break infinite loop in certain files #7

AndreasHegerGenomics commented May 12, 2016

jaredo commented May 12, 2016

AndreasHegerGenomics commented May 13, 2016

jaredo commented May 26, 2016

AndreasHegerGenomics commented Jun 1, 2016

jaredo commented Jun 6, 2016

AndreasHegerGenomics commented Jun 7, 2016

jaredo commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016 •

edited

Loading

jaredo commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016

jaredo commented Jun 7, 2016

{AH} break infinite loop in certain files #7

{AH} break infinite loop in certain files #7

Conversation

AndreasHegerGenomics commented May 12, 2016

jaredo commented May 12, 2016

AndreasHegerGenomics commented May 13, 2016

jaredo commented May 26, 2016

AndreasHegerGenomics commented Jun 1, 2016

jaredo commented Jun 6, 2016

AndreasHegerGenomics commented Jun 7, 2016

jaredo commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016 • edited Loading

jaredo commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016

jaredo commented Jun 7, 2016

AndreasHegerGenomics commented Jun 7, 2016 •

edited

Loading