-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge VariantAnnotation and DatabaseVariantAnnotation records #1250
Merge VariantAnnotation and DatabaseVariantAnnotation records #1250
Conversation
@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable | |||
* @tparam T The type of records to return | |||
* @return An RDD with records of the specified type | |||
*/ | |||
private[rdd] def loadParquet[T]( | |||
def loadParquet[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make this public again for unit tests in package o.b.a.projections
. It also allows for loading user-defined schema (e.g. extensions to bdg-formats) from Avro-in-Parquet files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps let's make it private[adam]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would be useful outside of ADAM. I haven't fully thought through the use case though: someone wants to add a new schema record Foo
, they extend ADAMKyroRegistrator
to register, then extend ADAMContext
to add their new loadFoo
method, which presumably would call loadParquet
for foo.adam
paths. If Foo
has a sequence dictionary or samples, those avro metadata methods would also be useful, and extending from GenomicRDD
and friends needs to be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, but I'd rather keep these private until someone knocks on our door asking to make them public. My philosophy here is simply that it is easier to make private interfaces public than it is to make public interfaces private. That being said, this is a weak preference: if you feel strongly about it, I'm OK with making it public, esp. since loadParquet
has been public previously.
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome. I've dropped a variety of suggestions and nits inline. Do we have a VCF with proper ANN fields that we could pull in and load in org.bdgenomics.adam.rdd.ADAMContextSuite
and then save back out? I think that's a good round trip test that we should add.
Also, I think we can punt the next thing to a later PR, but I think we could probably autogen the test data (and more tests) for the *FieldSuite
s. Adding them is a massive step forward though. Thanks for pushing those in as well!
@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging { | |||
|
|||
val te = TranscriptEffect.newBuilder() | |||
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_)) | |||
// note: annotationImpact is not mapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this comment; can you flesh it out more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotationImpact
field (and variable above) is output by SnpEff version 4.2 but is not part of the VCF ANN specification, so I did not include it in our TranscriptEffect
schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, can you add that inline?
@@ -110,6 +147,7 @@ object VariantAnnotations extends Serializable with Logging { | |||
|
|||
val te = TranscriptEffect.newBuilder() | |||
setIfNotEmpty(alternateAllele, te.setAlternateAllele(_)) | |||
// note: annotationImpact is not mapped | |||
if (!effects.isEmpty) te.setEffects(effects.asJava) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
effects.nonEmpty
@@ -132,26 +170,98 @@ object VariantAnnotations extends Serializable with Logging { | |||
Seq(te.build()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, as this line is unchanged, but whenever possible, I prefer Iterable
to Seq
unless you need random lookup by index.
variant: Variant, | ||
vc: VariantContext, | ||
stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = { | ||
stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning Option[List[TranscriptEffect]]
I would just return List[TranscriptEffect]
. If you would return a None
, I would just return a List.empty
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make my brain hurt less. The thought is elsewhere it matters whether this field has been set, so checking Option seemed more correct than checking for an empty list.
if (attr == VCFConstants.MISSING_VALUE_v4) { | ||
None | ||
} else { | ||
val filtered = parseAnn(attr, stringency).filter(_.getAlternateAllele == variant.getAlternateAllele) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make the above change, then the if-else clause here just becomes:
if (attr == VCFConstants.MISSING_VALUE_v4) {
List.empty
} else {
parseAnn(attr, stringency)
.filter(_.getAlternateAllele == variant.getAlternateAllele)
}
Also, I would break at the .filter
, because that line is a bit long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be adding try catch with validation stringency here shortly...
val numOpt = Option(numerator) | ||
val denomOpt = Option(denominator) | ||
|
||
val sb = StringBuilder.newBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code would be a bit cleaner with a match:
(numOpt, denomOpt) match {
case (Some(n), Some(d)) => {
"%d/%d".format(n, d)
}
case (None, None) => {
""
}
case _ => {
// validate/throw?
if (validationStringency == ValidationStringency.STRICT) {
throw new IllegalArgumentException("Incorrect fractional value in %s.".format(te))
} else if (validationStringency == ValidationStringency.LENIENT) {
log.warn("Incorrect fractional value in %s.".format(te))
}
""
}
}
Also, I would either make this package private/private, or move it inside of toAnn
, which I think is the only place it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was already private since it is nested in convertToVcfInfoAnnValue
? Still have some to learn about visibility in Scala. The tuple of options is cleaner. (I can't believe I just said that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right RE: protection; I had missed the nesting.
@@ -17,18 +17,12 @@ | |||
*/ | |||
package org.bdgenomics.adam.projections | |||
|
|||
import org.bdgenomics.formats.avro.DatabaseVariantAnnotation | |||
import org.bdgenomics.formats.avro.Contig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC, why does this show up as a file move? Any thoughts? May be just github being funky.
@@ -267,7 +267,7 @@ class ADAMContext private (@transient val sc: SparkContext) extends Serializable | |||
* @tparam T The type of records to return | |||
* @return An RDD with records of the specified type | |||
*/ | |||
private[rdd] def loadParquet[T]( | |||
def loadParquet[T]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps let's make it private[adam]
?
@@ -60,20 +60,19 @@ case class VariantContextRDD(rdd: RDD[VariantContext], | |||
* @param ann Annotation RDD to join against. | |||
* @return Returns a VariantContextRDD where annotations have been filled in. | |||
*/ | |||
def joinDatabaseVariantAnnotation(ann: DatabaseVariantAnnotationRDD): VariantContextRDD = { | |||
def joinVariantAnnotations(ann: VariantAnnotationRDD): VariantContextRDD = { | |||
replaceRdd(rdd.keyBy(_.variant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to open a ticket for this, but after #1216 this should probably be implemented using a region join instead of a Spark core leftOuterJoin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created new issue #1259
import com.google.common.collect.ImmutableList | ||
import htsjdk.samtools.ValidationStringency | ||
import htsjdk.variant.vcf.VCFConstants | ||
import htsjdk.variant.variantcontext.VariantContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: htsjdk.variant.vcf
after htsjdk.variant.variantcontext
Test PASSed. |
Test PASSed. |
Test FAILed. Build result: FAILURE[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1250/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 7eff061 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1250/merge^{commit} # timeout=10Checking out Revision 7eff061 (origin/pr/1250/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7eff06161dcee656f3c48996818a95cb92e96267First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
@heuermh - is it reasonable / useful for me to try to build the code in this PR locally to test it out at this point? I tried to compile it, but can't seem to find a version bdg-formats that works with it. I tried both |
@jpdna As is this branch will not compile due to the filter-related changes in bdg-formats. I've made the code changes locally but they need more unit tests. I'll push these in a commit tomorrow morning. |
@@ -143,6 +143,5 @@ class VariantContext( | |||
val position: ReferencePosition, | |||
val variant: RichVariant, | |||
val genotypes: Iterable[Genotype], | |||
val databases: Option[DatabaseVariantAnnotation] = None) { | |||
val databases: Option[VariantAnnotation] = None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"databases" seems kind of a strange name for this field now to me, I might prefer "annotations".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, databases was always a kinda strange name, but it's definitely weird now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch! fixed
Pushed new commits that fixes the separate variant and genotype filters issue and updates bdg-formats to the release version 0.10.0. I implemented the filter stuff to the best that htsjdk makes available to us; I could either continue to hack on it so that |
Test PASSed. |
Fixes #194 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 small nit on the filters, otherwise LGTM
val copy = VariantCallingAnnotations.newBuilder(annotations) | ||
// htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext | ||
// we might be able to calculate it by querying the FT FORMAT field value directly | ||
copy.setFiltersApplied(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work:
g.getAnyAttribute("FT") != null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it does not
https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/variantcontext/Genotype.java#L560
And careful, FT
is considered a forbidden key :)
https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/variantcontext/Genotype.java#L660
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create an issue to track the upstream htsjdk issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created new issue #1269
I tried to load this VCF: It seems to fail, but without an error message
This VCF does seem to load into a VariantRDD with loadVCF fine. I suspect there is something unexpected about the format of my VCF file ANN field, but if this is current snpeff output then it could be problem for some users. Can you point me to a test VCF with a ANN field that is working properly that I can compare to? |
@jpdna do you get any error/warning messages in the logs? If you have |
Where do I set `ValidationStringency.LENIENT" ? |
@jpdna It might be hard to follow, since things are spread over several issues, but this pull request does not yet support populating |
Sure, but even then @jpdna should be getting one VariantAnnotation record per Variant, no? |
Maybe, I don't know how well that part of the code works. Based on this and recent conversations on gitter (same issue apparently), not too well? |
ah, thanks for clarifying @heuermh - I'll plan to watch this PR then for the further commits and try my test again when you ping that reading ANN field into transcriptEffects is ready. Perhaps some rows of the VCF I linked to above can be a useful in the test suite - both a VEP and SNPeff derived example annotated VCF would be good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two small changes: parseAndFilter
should be private and there's still an Option related NPE issue in convertToVcfInfoAnnValue
. Can you clean these two up and I will merge this PR manually?
*/ | ||
def convertToVcfInfoAnnValue(effects: Seq[TranscriptEffect]): String = { | ||
def toFraction(numerator: java.lang.Integer, denominator: java.lang.Integer): String = { | ||
val numOpt = Option(numerator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This NPE with Option types still needs to be fixed.
stringency: ValidationStringency = ValidationStringency.STRICT): VariantAnnotation = { | ||
stringency: ValidationStringency = ValidationStringency.STRICT): Option[List[TranscriptEffect]] = { | ||
|
||
def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error: illegal start of statement (no modifiers allowed here) [ERROR] private def parseAndFilter(attr: String): Option[List[TranscriptEffect]] = {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I misread this and didn't notice that it is nested inside another function.
val copy = VariantCallingAnnotations.newBuilder(annotations) | ||
// htsjdk does not provide a field filtersWereApplied for genotype as it does in VariantContext | ||
// we might be able to calculate it by querying the FT FORMAT field value directly | ||
copy.setFiltersApplied(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create an issue to track the upstream htsjdk issue?
loadVcf(filePath).toDatabaseVariantAnnotationRDD | ||
def loadVcfAnnotations( | ||
filePath: String): VariantAnnotationRDD = { | ||
loadVcf(filePath).toVariantAnnotationRDD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for tracking RE @jpdna's comment about not getting any annotations from a VCF, this line is the culprit. Specifically, loadVcf
right now just parses the Genotype
s. We should make the VariantContextConverter
parse out the annotations by default in the follow on PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes necessary in this PR, just an FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @jpdna it'd be great to add some unit tests that use that file and try to load a few ANN fields. That should be an acceptance test for the release. Would you be able to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running SnpEff on the VCF files we are already using for unit tests ends up being not too interesting, with all intragenic variants. It might take a little thinking to generate a more useful VCF file, say with variants right at intron/exon boundaries of a gene with a lot of splice variants, for example.
Pushed commit with some additional unit tests. Let me know if I've addressed all the review comments, and thank you for volunteering to merge this manually. |
LGTM now! I will merge this manually shortly. |
Test PASSed. |
12f245c
to
c06143b
Compare
Woot! Thank you, @fnothaft! |
Supercedes #1144