Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option -r / --regions off-by-1: returns indels one past the end of the region #1420

Closed
jmarshall opened this issue Feb 18, 2021 · 7 comments
Closed

Comments

@jmarshall
Copy link
Member

jmarshall commented Feb 18, 2021

Consider events.vcf, which ends with the following VCF records:

… …
chr1	199	s3	A	T	.	PASS	.
chr1	199	d5f	AT	A	.	PASS	.
chr1	199	i4f	A	ATTTT	.	PASS	.
chr1	200	s4f	A	T	.	PASS	.
chr1	200	outd2	AT	A	.	AFTER	.
chr1	200	outi4	A	AT	.	AFTER	.
chr1	201	outs2	A	T	.	AFTER	.
chr1	201	outd3	AT	A	.	AFTER	.
chr1	201	outi5	A	ATTTT	.	AFTER	.

If we query this for chr1:100-200, we would expect to receive the PASS records but not the AFTER records. In particular for the records with POS=200, we would expect get s4f back as the base at position 200 has been changed, but IMNSHO not outd2 or outi4 as they only affect bases to the right of position 200 so are outwith the specified region.

However bcftools's index-based -r/--regions/-R/--regions-file query (both as released and on develop) returns all three of these records:

$ bgzip events.vcf && bcftools index events.vcf.gz
$ bcftools view -r chr1:100-200 events.vcf.gz
##fileformat=VCF4.3
… …
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
… …
chr1	199	s3	A	T	.	PASS	.
chr1	199	d5f	AT	A	.	PASS	.
chr1	199	i4f	A	ATTTT	.	PASS	.
chr1	200	s4f	A	T	.	PASS	.
chr1	200	outd2	AT	A	.	AFTER	.
chr1	200	outi4	A	AT	.	AFTER	.

@jkbonfield
Copy link
Contributor

It depends on the question being asked - is it to return records whose POS field is within this range or records whose base changes are within this range?

I'm curious to know how other tools handle this scenario. Tabix would clearly return those two records, and making it do anything else would be almost impossible as it's a general purpose query that doesn't understand things like VCF indel syntax. I can't find anything in Picard that can subset VCFs, and GATK SelectVariants just tells me "no suitable codecs found". I've never quite got to grips with the appropriate hoops to jump there. :(

@jmarshall
Copy link
Member Author

That is true. There is a purely computational question about overlapping with the coordinates subtended by the VCF record, and there is the scientifically meaningful question about finding events that overlap with the specified region. Personally I expect bcftools view to facilitate the scientific question first…

@jkbonfield
Copy link
Contributor

Yes, but it's not helpful to users if different tools use different methods of assessing overlaps, hence why I say we need to look at other programs. It helps no one to have yet more variation across bioinformatics tools.

Alas my arcane knowledge is insufficient. If you know the magic runes required to get GATK to do anything then give it a whirl. I got nowhere. I also couldn't get Picard to swallow it, regardless what version I put in the header. It simply kept saying "VCF4."whatever isn't a valid version, but it wouldn't tell me what versions it does support.

@pd3
Copy link
Member

pd3 commented Sep 3, 2021

One note, records such as

1	200	.	AT	A

where the true variation starts after the end of the region (e.g. 1:100-200) should be reported as well. Otherwise, if we insisted on the program reporting only real variation, then records with empty ALT column such as

1	200	.	A	.

should be never printed, which is obviously not what we want.

@jmarshall
Copy link
Member Author

The raison d'être of this bug report is that 1 200 . AT A does not touch the base at 200 therefore should NOT be reported by a query for 1:100-200.

The definition of what bases 1 200 . A . should be considered to touch is a separate question. Normally in VCF both REF and ALT are required to be non-empty; ALT=. is a special case. The spec says ALT=. indicates “no variant [at that position, presumably]” so I think programs should consider it to touch the single base in the range 200–200. Hence it would be reported by a query for 1:100-200.

(This bug report is that -r returns slightly too many hits due to the “touched bases” for indels being slightly too generous. It is a minor bug compared to #1421, in which -t returns too few hits — compared to the expectations of users who didn't read the documentation extremely carefully.)

@pd3
Copy link
Member

pd3 commented Sep 6, 2021

Handling indels is difficult and it is not always clear what is the right thing to do. For example, if the deleted T is part of a homopolymer run, we cannot distinguish which of the T's was actually deleted. In other situations, one just wants to split the data into reasonably sized chunks without caring about what is happening on the sequence level. So, some would argue, as you do, that "it should NOT be reported", others say that it "MUST" be reported. There is no single universally correct approach, the only solution is to make the behavior optional. I have a half-finished code that I put aside a while ago, I may revisit it.

pd3 added a commit to pd3/htslib that referenced this issue Sep 7, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
@pd3 pd3 closed this as completed in 0d04159 Sep 7, 2021
@jmarshall jmarshall reopened this Sep 7, 2021
pd3 added a commit to pd3/htslib that referenced this issue Sep 8, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
pd3 added a commit to pd3/htslib that referenced this issue Sep 8, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
pd3 added a commit to pd3/htslib that referenced this issue Sep 8, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
whitwham pushed a commit to samtools/htslib that referenced this issue Sep 9, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
@jkbonfield
Copy link
Contributor

Given samtools/htslib#1327 is now merged and the corresponding bcftools commits are in (0d04159) we believe this to be fixed.

@samtools samtools deleted a comment from jmarshall Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants