-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option -r / --regions off-by-1: returns indels one past the end of the region #1420
Comments
It depends on the question being asked - is it to return records whose POS field is within this range or records whose base changes are within this range? I'm curious to know how other tools handle this scenario. Tabix would clearly return those two records, and making it do anything else would be almost impossible as it's a general purpose query that doesn't understand things like VCF indel syntax. I can't find anything in Picard that can subset VCFs, and GATK SelectVariants just tells me "no suitable codecs found". I've never quite got to grips with the appropriate hoops to jump there. :( |
That is true. There is a purely computational question about overlapping with the coordinates subtended by the VCF record, and there is the scientifically meaningful question about finding events that overlap with the specified region. Personally I expect |
Yes, but it's not helpful to users if different tools use different methods of assessing overlaps, hence why I say we need to look at other programs. It helps no one to have yet more variation across bioinformatics tools. Alas my arcane knowledge is insufficient. If you know the magic runes required to get GATK to do anything then give it a whirl. I got nowhere. I also couldn't get Picard to swallow it, regardless what version I put in the header. It simply kept saying "VCF4."whatever isn't a valid version, but it wouldn't tell me what versions it does support. |
One note, records such as
where the true variation starts after the end of the region (e.g.
should be never printed, which is obviously not what we want. |
The raison d'être of this bug report is that The definition of what bases (This bug report is that |
Handling indels is difficult and it is not always clear what is the right thing to do. For example, if the deleted T is part of a homopolymer run, we cannot distinguish which of the T's was actually deleted. In other situations, one just wants to split the data into reasonably sized chunks without caring about what is happening on the sequence level. So, some would argue, as you do, that "it should NOT be reported", others say that it "MUST" be reported. There is no single universally correct approach, the only solution is to make the behavior optional. I have a half-finished code that I put aside a while ago, I may revisit it. |
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
Given samtools/htslib#1327 is now merged and the corresponding bcftools commits are in (0d04159) we believe this to be fixed. |
Consider events.vcf, which ends with the following VCF records:
If we query this for
chr1:100-200
, we would expect to receive thePASS
records but not theAFTER
records. In particular for the records with POS=200, we would expect gets4f
back as the base at position 200 has been changed, but IMNSHO notoutd2
orouti4
as they only affect bases to the right of position 200 so are outwith the specified region.However bcftools's index-based
-r
/--regions
/-R
/--regions-file
query (both as released and on develop) returns all three of these records:The text was updated successfully, but these errors were encountered: