-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push validation checks down to INFO/FORMAT fields #1676
Comments
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.
I might need some convincing this is the right thing to do. For example in #1512, if the |
The way I think about it is that if you have a As an aside, |
#1566 is different I think, all the values are bad, and validation chokes on the header line. |
With this patch, you can load all of the variants in the GIAB VCFs by setting validation stringency to lenient. You lose the PS fields, but you get all the variants. |
What does that look like? |
Shows as
|
If we've kept "All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."
Is this from the input VCF? I would imagine attempting to write as VCF or ADAM format will throw the same |
For what it is worth, on the PS Type=String issue specifically, this works ok, and allows ADAM to read and write after remapping
|
Yeah, that approach is fine for the GIAB files which have a single, well defined validation error and are reasonably small files. That said, its pretty easy to imagine someone putting together a VCF where many lines from a VCF fail validation due to bad tags, or where there are multiple tags that fail validation. As long as our default is to have strict validation, I don't think there's a harm to adding this, since the user will get an exception the first time they load the VCF and will have to choose to override the stringency. |
I would think of it this way; if we validate at the tag level, users can load data that they know is bad and then write code in ADAM (similar to what you did in dsh) to fix the bad fields. If we don't validate at the tag level, they just lose the records. As a compromise, I offer this: how about we go from one stringency field to two? We could have a first validation flag which is for tags (throw/warn/ignore for bad tags) and then a second for records. This way, we can allow people to be really permissive (load in records that they know have bad tags because either they don't care about the tags or because they have code that can fix the bad tags) while also making the system very conservative by default. |
Throwing out either the offending record or the values for an offending tag across all records seems ok to me. Is that what you are suggesting? |
#1695 is an instance of this |
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.
Resolves #1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.
Currently, we toss any genotype/variant that fails any part of validation. We should just drop the failing INFO/FORMAT tag, if possible.
The text was updated successfully, but these errors were encountered: