Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push validation checks down to INFO/FORMAT fields #1676

Closed
fnothaft opened this issue Aug 22, 2017 · 12 comments
Closed

Push validation checks down to INFO/FORMAT fields #1676

fnothaft opened this issue Aug 22, 2017 · 12 comments
Assignees
Labels
Milestone

Comments

@fnothaft
Copy link
Member

Currently, we toss any genotype/variant that fails any part of validation. We should just drop the failing INFO/FORMAT tag, if possible.

@fnothaft fnothaft added the bug label Aug 22, 2017
@fnothaft fnothaft self-assigned this Aug 22, 2017
fnothaft added a commit to fnothaft/adam that referenced this issue Aug 22, 2017
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.
@heuermh
Copy link
Member

heuermh commented Aug 22, 2017

I might need some convincing this is the right thing to do.

For example in #1512, if the +Inf value is for a mapped field (say Variant.alleleFrequency) we'd be left with a variant with null for that field, which would later be written out as missing value (.).

@fnothaft
Copy link
Member Author

The way I think about it is that if you have a +Inf value that you can't parse and then drop the whole record, you've lost all of the data at the variant site for all samples (because we drop the whole offending variant context) just because a single tag (which may or may have not been useful) was bad. E.g., in the context of #1566, this would allow us to keep the variants with the bad phase set IDs; we would just drop the phase set tag on those records.

As an aside, null shouldn't be written out as a . unless there are multiple genotypes at the variant site that have that tag.

@heuermh
Copy link
Member

heuermh commented Aug 22, 2017

#1566 is different I think, all the values are bad, and validation chokes on the header line.

@fnothaft
Copy link
Member Author

With this patch, you can load all of the variants in the GIAB VCFs by setting validation stringency to lenient. You lose the PS fields, but you get all the variants.

@heuermh
Copy link
Member

heuermh commented Aug 22, 2017

You lose the PS fields, but you get all the variants.

What does that look like? Genotype.phased set to false or unset (null)? Genotype.phaseSetId set to null? Is the incorrect header ##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT"> still present? Can you write back out to VCF or to ADAM format?

@fnothaft
Copy link
Member Author

fnothaft commented Aug 22, 2017

Shows as phased=true:

zIRISz:adam fnothaft$ ./bin/adam-submit print HG002.gt/ | grep 817185
2017-08-22 10:11:36 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
{"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable", "filt": "CS_HiSeqMatePairSentieon_filt", "datasetsmissingcall": "10XChromium,IonExome,SolidSE75bp", "platformnames": "Illumina,CG", "callsetnames": "HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes", "datasets": "4", "callsets": "7", "datasetnames": "HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair", "platforms": "2"}}}, "contigName": "chr1", "start": 817185, "end": 817186, "variantCallingAnnotations": {"filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": null, "rmsMapQ": null, "mapq0Reads": null, "mqRankSum": null, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {"IGT": "1/1", "ADALL": "0,362"}}, "sampleId": "HG002", "sampleDescription": null, "processingDescription": null, "alleles": ["ALT", "ALT"], "expectedAlleleDosage": null, "referenceReadDepth": 75, "alternateReadDepth": 603, "readDepth": 1131, "minReadDepth": null, "genotypeQuality": 312, "genotypeLikelihoods": [], "nonReferenceLikelihoods": [], "strandBiasComponents": [], "splitFromMultiAllelic": false, "phased": true, "phaseSetId": null, "phaseQuality": null}
zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1	817186	.	G	A	50	PASS	platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=10XChromium,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable;filt=CS_HiSeqMatePairSentieon_filt	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	1|1:1131:0,362:75,603:312:1/1:.:PATMAT

@heuermh
Copy link
Member

heuermh commented Aug 22, 2017

..."phased": true, "phaseSetId": null

If we've kept phased as true and removed the phaseSetId, then according to the VCF spec all genotypes in that file are considered phased and in the same phase set.

"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."
https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L473

zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1 817186 . G A 50 PASS platforms=2;platfo...

Is this from the input VCF? I would imagine attempting to write as VCF or ADAM format will throw the same htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS as in #1566.

@heuermh
Copy link
Member

heuermh commented Aug 22, 2017

For what it is worth, on the PS Type=String issue specifically, this works ok, and allows ADAM to read and write after remapping

$ dsh-bio remap-phase-set \
  -i truth_small_variants.vcf.gz \
  -o truth_small_variants.remapped_phase_sets.vcf.gz

$ adam-shell

scala> val genotypes = sc.loadGenotypes("truth_small_variants.vcf.gz")
java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
  at org.bdgenomics.adam.converters.VariantContextConverter$.org$bdgenomics$adam$converters$VariantContextConverter$$auditLine$1(VariantContextConverter.scala:175)
...

scala> val genotypes = sc.loadGenotypes("truth_small_variants.remapped_phase_sets.vcf.gz")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
RDDBoundGenotypeRDD(MapPartitionsRDD[7] at flatMap at VariantContextRDD.scala:183,SequenceDictionary{...

scala> val g = genotypes.rdd.first()
g: org.bdgenomics.formats.avro.Genotype = {"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable", "filt": "CS_SolidSE75GATKHC_filt", "datas...

scala> g.getPhaseSetId
res1: Integer = 0

scala> g.getPhased
res2: Boolean = true

@fnothaft
Copy link
Member Author

Yeah, that approach is fine for the GIAB files which have a single, well defined validation error and are reasonably small files. That said, its pretty easy to imagine someone putting together a VCF where many lines from a VCF fail validation due to bad tags, or where there are multiple tags that fail validation. As long as our default is to have strict validation, I don't think there's a harm to adding this, since the user will get an exception the first time they load the VCF and will have to choose to override the stringency.

@fnothaft
Copy link
Member Author

I would think of it this way; if we validate at the tag level, users can load data that they know is bad and then write code in ADAM (similar to what you did in dsh) to fix the bad fields. If we don't validate at the tag level, they just lose the records.

As a compromise, I offer this: how about we go from one stringency field to two? We could have a first validation flag which is for tags (throw/warn/ignore for bad tags) and then a second for records. This way, we can allow people to be really permissive (load in records that they know have bad tags because either they don't care about the tags or because they have code that can fix the bad tags) while also making the system very conservative by default.

@heuermh
Copy link
Member

heuermh commented Aug 23, 2017

Throwing out either the offending record or the values for an offending tag across all records seems ok to me. Is that what you are suggesting?

@heuermh
Copy link
Member

heuermh commented Aug 29, 2017

#1695 is an instance of this

fnothaft added a commit to fnothaft/adam that referenced this issue Sep 12, 2017
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.
fnothaft added a commit to fnothaft/adam that referenced this issue Sep 13, 2017
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.
heuermh pushed a commit that referenced this issue Sep 14, 2017
Resolves #1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.
@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants