Push validation checks down to INFO/FORMAT fields #1676

fnothaft · 2017-08-22T06:30:26Z

Currently, we toss any genotype/variant that fails any part of validation. We should just drop the failing INFO/FORMAT tag, if possible.

Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.

heuermh · 2017-08-22T15:40:53Z

I might need some convincing this is the right thing to do.

For example in #1512, if the +Inf value is for a mapped field (say Variant.alleleFrequency) we'd be left with a variant with null for that field, which would later be written out as missing value (.).

fnothaft · 2017-08-22T16:29:13Z

The way I think about it is that if you have a +Inf value that you can't parse and then drop the whole record, you've lost all of the data at the variant site for all samples (because we drop the whole offending variant context) just because a single tag (which may or may have not been useful) was bad. E.g., in the context of #1566, this would allow us to keep the variants with the bad phase set IDs; we would just drop the phase set tag on those records.

As an aside, null shouldn't be written out as a . unless there are multiple genotypes at the variant site that have that tag.

heuermh · 2017-08-22T16:33:16Z

#1566 is different I think, all the values are bad, and validation chokes on the header line.

fnothaft · 2017-08-22T16:36:57Z

With this patch, you can load all of the variants in the GIAB VCFs by setting validation stringency to lenient. You lose the PS fields, but you get all the variants.

heuermh · 2017-08-22T16:45:41Z

You lose the PS fields, but you get all the variants.

What does that look like? Genotype.phased set to false or unset (null)? Genotype.phaseSetId set to null? Is the incorrect header ##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT"> still present? Can you write back out to VCF or to ADAM format?

fnothaft · 2017-08-22T17:12:27Z

Shows as phased=true:

zIRISz:adam fnothaft$ ./bin/adam-submit print HG002.gt/ | grep 817185
2017-08-22 10:11:36 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
{"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable", "filt": "CS_HiSeqMatePairSentieon_filt", "datasetsmissingcall": "10XChromium,IonExome,SolidSE75bp", "platformnames": "Illumina,CG", "callsetnames": "HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes", "datasets": "4", "callsets": "7", "datasetnames": "HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair", "platforms": "2"}}}, "contigName": "chr1", "start": 817185, "end": 817186, "variantCallingAnnotations": {"filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": null, "rmsMapQ": null, "mapq0Reads": null, "mqRankSum": null, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {"IGT": "1/1", "ADALL": "0,362"}}, "sampleId": "HG002", "sampleDescription": null, "processingDescription": null, "alleles": ["ALT", "ALT"], "expectedAlleleDosage": null, "referenceReadDepth": 75, "alternateReadDepth": 603, "readDepth": 1131, "minReadDepth": null, "genotypeQuality": 312, "genotypeLikelihoods": [], "nonReferenceLikelihoods": [], "strandBiasComponents": [], "splitFromMultiAllelic": false, "phased": true, "phaseSetId": null, "phaseQuality": null}
zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1	817186	.	G	A	50	PASS	platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=10XChromium,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable;filt=CS_HiSeqMatePairSentieon_filt	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	1|1:1131:0,362:75,603:312:1/1:.:PATMAT

heuermh · 2017-08-22T18:22:36Z

..."phased": true, "phaseSetId": null

If we've kept phased as true and removed the phaseSetId, then according to the VCF spec all genotypes in that file are considered phased and in the same phase set.

"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."
https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L473

zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1 817186 . G A 50 PASS platforms=2;platfo...

Is this from the input VCF? I would imagine attempting to write as VCF or ADAM format will throw the same htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS as in #1566.

heuermh · 2017-08-22T21:18:37Z

For what it is worth, on the PS Type=String issue specifically, this works ok, and allows ADAM to read and write after remapping

$ dsh-bio remap-phase-set \
  -i truth_small_variants.vcf.gz \
  -o truth_small_variants.remapped_phase_sets.vcf.gz

$ adam-shell

scala> val genotypes = sc.loadGenotypes("truth_small_variants.vcf.gz")
java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
  at org.bdgenomics.adam.converters.VariantContextConverter$.org$bdgenomics$adam$converters$VariantContextConverter$$auditLine$1(VariantContextConverter.scala:175)
...

scala> val genotypes = sc.loadGenotypes("truth_small_variants.remapped_phase_sets.vcf.gz")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
RDDBoundGenotypeRDD(MapPartitionsRDD[7] at flatMap at VariantContextRDD.scala:183,SequenceDictionary{...

scala> val g = genotypes.rdd.first()
g: org.bdgenomics.formats.avro.Genotype = {"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable", "filt": "CS_SolidSE75GATKHC_filt", "datas...

scala> g.getPhaseSetId
res1: Integer = 0

scala> g.getPhased
res2: Boolean = true

fnothaft · 2017-08-23T00:12:11Z

Yeah, that approach is fine for the GIAB files which have a single, well defined validation error and are reasonably small files. That said, its pretty easy to imagine someone putting together a VCF where many lines from a VCF fail validation due to bad tags, or where there are multiple tags that fail validation. As long as our default is to have strict validation, I don't think there's a harm to adding this, since the user will get an exception the first time they load the VCF and will have to choose to override the stringency.

fnothaft · 2017-08-23T00:23:58Z

I would think of it this way; if we validate at the tag level, users can load data that they know is bad and then write code in ADAM (similar to what you did in dsh) to fix the bad fields. If we don't validate at the tag level, they just lose the records.

As a compromise, I offer this: how about we go from one stringency field to two? We could have a first validation flag which is for tags (throw/warn/ignore for bad tags) and then a second for records. This way, we can allow people to be really permissive (load in records that they know have bad tags because either they don't care about the tags or because they have code that can fix the bad tags) while also making the system very conservative by default.

heuermh · 2017-08-23T00:46:41Z

Throwing out either the offending record or the values for an offending tag across all records seems ok to me. Is that what you are suggesting?

heuermh · 2017-08-29T15:07:51Z

#1695 is an instance of this

Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.

Resolves #1676. Pushes validation checking down to the single field conversion level, which keeps a variant/genotype record around even if a bad INFO/FORMAT field was attached to that record.

fnothaft added the bug label Aug 22, 2017

fnothaft self-assigned this Aug 22, 2017

fnothaft mentioned this issue Aug 22, 2017

[ADAM-1676] Add more finely grained validation for INFO/FORMAT fields. #1677

Merged

heuermh closed this as completed in #1677 Sep 14, 2017

heuermh added this to the 0.23.0 milestone Dec 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push validation checks down to INFO/FORMAT fields #1676

Push validation checks down to INFO/FORMAT fields #1676

fnothaft commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 22, 2017 •

edited

Loading

heuermh commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 23, 2017

fnothaft commented Aug 23, 2017

heuermh commented Aug 23, 2017

heuermh commented Aug 29, 2017

Push validation checks down to INFO/FORMAT fields #1676

Push validation checks down to INFO/FORMAT fields #1676

Comments

fnothaft commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 22, 2017 • edited Loading

heuermh commented Aug 22, 2017

heuermh commented Aug 22, 2017

fnothaft commented Aug 23, 2017

fnothaft commented Aug 23, 2017

heuermh commented Aug 23, 2017

heuermh commented Aug 29, 2017

fnothaft commented Aug 22, 2017 •

edited

Loading