[ADAM-651] Hive-style partitioning of parquet files by genomic position #1878

jpdna · 2018-01-19T15:40:37Z

Fixes #651

Manually merged changes for the "hive-style" partitioning branch as a single commit on top of master.

coveralls · 2018-01-19T15:54:51Z

Coverage decreased (-0.2%) to 82.6% when pulling 7cbe2f3 on jpdna:hive_partitioned_v5 into 4223f56 on bigdatagenomics:master.

AmplabJenkins · 2018-01-19T16:02:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2578/
Test PASSed.

heuermh

Thanks for cleaning up the rebase/merge stuff, @jpdna!

heuermh · 2018-01-19T15:50:37Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param pathName The path name to load alignment records from.
+   *   Globs/directories are supported.
+   * @param regions Optional list of genomic regions to load.
+   * @param addChrPrefix Flag to add "chr" prefix to contigs


I don't think this should be part of the API, and in fact simply adding or removing "chr" is not sufficient for converting between the different styles. See e.g. https://github.com/heuermh/dishevelled-bio/blob/master/tools/src/main/java/org/dishevelled/bio/tools/RenameReferences.java#L125 and below

agreed, I don't like it either in the API.
I'll try to push any needed conversion into the application code (Mango) by having it look at sequence dictionary and see if a conversion is needed, so that by the time a ReferenceRegion makes it into ADAM code it is on the correct contig name convention for the underlying source dataset.
@heuermh - I'll plan to use the replacement logic you pointed to - thanks!

+1 towards not including it and pushing it to user level. FYI, you can't link against dishevelled-bio as it is LGPL.

You can't link against dishevelled-bio as it is LGPL.

I'm the only copyright holder in dishevelled-bio, so I could relicense stuff in there if necessary. I don't think this bit is interesting enough to do so, and it only covered the one use case I was interested in. That's why I haven't submitted something identical as a solution for #1757.

heuermh · 2018-01-19T15:51:23Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param pathName The path name to load alignment records from.
+   *   Globs/directories are supported.
+   * @param regions Optional list of genomic regions to load.
+   * @param addChrPrefix Flag to add "chr" prefix to contigs


Remove as above

heuermh · 2018-01-19T15:52:28Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param pathName The path name to load alignment records from.
+   *   Globs/directories are supported.
+   * @param regions Optional list of genomic regions to load.
+   * @param addChrPrefix Flag to add "chr" prefix to contigs


Remove as above

heuermh · 2018-01-19T15:52:47Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param addChrPrefix Flag to add "chr" prefix to contigs
+   * @return Returns a FeatureRDD.
+   */
+


Remove extra whitespace

heuermh · 2018-01-19T15:52:57Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+    }
+
+    datasetBoundFeatureRDD
+


Remove extra whitespace

heuermh · 2018-01-19T16:37:44Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+      .option("spark.sql.parquet.compression.codec", compressCodec.toString.toLowerCase())
+      .save(filePath)
+    writePartitionedParquetFlag(filePath)
+    //rdd.context.writePartitionedParquetFlag(filePath)


Remove commented out code

heuermh · 2018-01-19T16:38:21Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/feature/FeatureRDDSuite.scala

@@ -925,6 +925,33 @@ class FeatureRDDSuite extends ADAMFunSuite {
    assert(rdd3.dataset.count === 4)
  }

+  sparkTest("load paritioned parquet to sql, save, re-read from avro") {


paritioned → partitioned

heuermh · 2018-01-19T16:38:47Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala

@@ -638,6 +638,41 @@ class AlignmentRecordRDDSuite extends ADAMFunSuite {
    assert(rdd3.dataset.count === 20)
  }

+  sparkTest("load from sam, save as partitioend parquet, and and re-read from partitioned parquet") {


partitioend → partitioned

heuermh · 2018-01-19T16:39:15Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/variant/GenotypeRDDSuite.scala

@@ -128,6 +128,15 @@ class GenotypeRDDSuite extends ADAMFunSuite {
    assert(starts(752790L))
  }

+  sparkTest("round trip to paritioned parquet") {


paritioned → partitioned

heuermh · 2018-01-19T16:44:40Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+      "Options other than compression codec are ignored.")
+    val df = toDF()
+
+    df.withColumn("posBin", floor(df("start") / partitionSize))


"posBin" → "position" or "positionBin" or "bin"

coveralls · 2018-01-20T04:29:59Z

Coverage decreased (-0.09%) to 82.616% when pulling 5a2be53 on jpdna:hive_partitioned_v5 into adff336 on bigdatagenomics:master.

AmplabJenkins · 2018-01-20T04:37:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2580/
Test PASSed.

fnothaft

Thanks @jpdna! It looks like this is close to ready!

fnothaft · 2018-01-21T03:51:03Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param pathName The path name to load alignment records from.
+   *   Globs/directories are supported.
+   * @param regions Optional list of genomic regions to load.
+   * @param addChrPrefix Flag to add "chr" prefix to contigs


+1 towards not including it and pushing it to user level. FYI, you can't link against dishevelled-bio as it is LGPL.

fnothaft · 2018-01-21T03:51:23Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+    val reads: AlignmentRecordRDD = ParquetUnboundAlignmentRecordRDD(sc, pathName, sd, rgd, pgs)
+
+    val datasetBoundAlignmentRecordRDD: AlignmentRecordRDD = regions match {
+      case Some(x) => DatasetBoundAlignmentRecordRDD(reads.dataset.filter(referenceRegionsToDatasetQueryString(x)), reads.sequences, reads.recordGroups, reads.processingSteps)


Nit: Break longlines.

fnothaft · 2018-01-21T03:52:13Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param addChrPrefix Flag to add "chr" prefix to contigs
+   * @return Returns a VariantRDD
+   */
+


Extra whitespace.

fnothaft · 2018-01-21T03:53:15Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param addChrPrefix Flag to add "chr" prefix to contigs
+   * @return Returns an AlignmentRecordRDD.
+   */
+  def loadPartitionedParquetAlignments(pathName: String, regions: Option[Iterable[ReferenceRegion]] = None): AlignmentRecordRDD = {


Anywhere you have Option[Iterable[ReferenceRegion]] = None should be Iterable[ReferenceRegion] = Iterable.empty.

fnothaft · 2018-01-21T03:54:33Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param addChrPrefix Flag to add "chr" prefix to contigs
+   * @return Returns a GenotypeRDD.
+   */
+  def loadPartitionedParquetGenotypes(pathName: String, regions: Option[Iterable[ReferenceRegion]] = None): GenotypeRDD = {


See above comment RE: Option[Iterable[ReferenceRegion]] = None.

fnothaft · 2018-01-21T03:55:41Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @param addChrPrefix Flag to add "chr" prefix to contigs
+   * @return Returns a NucleotideContigFragmentRDD
+   */
+  def loadPartitionedParquetFragments(pathName: String, regions: Option[Iterable[ReferenceRegion]] = None): NucleotideContigFragmentRDD = {


See above comment RE: Option[Iterable[ReferenceRegion]] = None.

fnothaft · 2018-01-21T03:56:18Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * @return Return True if partitioned flag found, False otherwise.
+   */
+
+  def checkPartitionedParquetFlag(filePath: String): Boolean = {


fnothaft · 2018-01-21T03:57:58Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   */
+
+  def checkPartitionedParquetFlag(filePath: String): Boolean = {
+    val path = new Path(filePath, "_isPartitionedByStartPos")


Yeah, I'd suggest using the getFsAndFilesWithFilter function above. Behavior should be undefined if you have a glob but not all the paths are partitioned.

fnothaft · 2018-01-21T04:00:45Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+
+  def referenceRegionsToDatasetQueryString(regions: Iterable[ReferenceRegion], partitionSize: Int = 1000000): String = {
+
+    var regionQueryString = "(contigName=" + "\'" + regions.head.referenceName + "\' and posBin >= \'" +


This will throw if regions.isEmpty, suggest:

regions.map(r => { // logic for a single reference region goes here }).mkString(" or " )

fnothaft · 2018-01-21T04:01:15Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+
+  def writePartitionedParquetFlag(filePath: String): Boolean = {
+    val path = new Path(filePath, "_isPartitionedByStartPos")
+    val fs = path.getFileSystem(toDF().sqlContext.sparkContext.hadoopConfiguration)


+1, should just be rdd.context.hadoopConfiguration

AmplabJenkins · 2018-01-23T15:05:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2586/
Test PASSed.

AmplabJenkins · 2018-01-23T18:19:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2587/
Test PASSed.

AmplabJenkins · 2018-01-23T18:42:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2588/
Test PASSed.

AmplabJenkins · 2018-01-23T19:21:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2589/
Test PASSed.

jpdna · 2018-01-23T20:36:49Z

I believe I have addressed the reviewer requests above, except for the following discussed below:

add loadPartitionedParquetFragments that returns FragmentRDD and loadPartitionedParquetCoverage that returns CoverageRDD.
I agree these should be added for completeness, but I'd rather not hold up accepting the existing changes because I don't believe Mango needs those for current demo project.
java friendly method
would like to delay for a future update
make referenceRegionsToDatasetQueryString private
This requires a bit of work as in the effort with Mango to optimize latency by "caching" the handle to a dataset to be re-used with multiple filters, Mango currently uses referenceRegionsToDatasetQueryString
https://github.com/jpdna/mango/blob/18369e43354f0de3f6804ab7ed83b5923d001538/mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala#L224
I agree though that this is an implementation detail that doesn't need to be the the api, and we should incorporate into the API the ability to do the dataset handle caching optimization that Mango uses.
I think we could solve this be adding another version of the partitioned parquet load functions that takes an existing dataset as input rather than a pathname.

def filterPartitionedParquetAlignmentRecordRDDbyRegions(dataset: AlignmentRecordRDD, regions: Iterable[ReferenceRegion] = Iterable.empty): AlignmentRecordRDD = {

I'll go ahead and start making those changes, and testing Mango with them, but I'd like to get the rest of this PR through a second pass in parallel.

OR - as we do need to get this merged sooner rather than later for Mango - what if we leave the referenceRegionsToDatasetQueryString as public for now but mark it as deprecated?

AmplabJenkins · 2018-01-23T20:53:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2590/
Test PASSed.

akmorrow13 · 2018-01-23T21:21:59Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/contig/NucleotideContigFragmentRDDSuite.scala

@@ -857,4 +888,5 @@ class NucleotideContigFragmentRDDSuite extends ADAMFunSuite {

    checkSave(variantContexts)
  }
+


Remove line

akmorrow13 · 2018-01-23T21:22:29Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/feature/FeatureRDDSuite.scala

+      assert(sequenceRdd.sequences.containsRefName("aSequence"))
+    }
+
+    val inputPath = testFile("small.1.bed")


There are a lot of asserts here. Can you comment their purpose or break them into separate tests?

done, removed intermediate step asserts which were redundant.

akmorrow13 · 2018-01-23T21:22:42Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/variant/GenotypeRDDSuite.scala

+    genotypes.saveAsPartitionedParquet(outputPath)
+    val unfilteredGenotypes = sc.loadPartitionedParquetGenotypes(outputPath)
+    assert(unfilteredGenotypes.rdd.count === 18)
+


remove line

akmorrow13 · 2018-01-23T21:22:50Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/variant/VariantRDDSuite.scala

+    assert(unfilteredVariants.rdd.count === 6)
+    assert(unfilteredVariants.dataset.count === 6)
+
+    val regionsVariants = sc.loadPartitionedParquetVariants(outputPath, List(ReferenceRegion("2", 19000L, 21000L), ReferenceRegion("13", 752700L, 752750L)))


akmorrow13 · 2018-01-23T21:24:05Z

@jpdna it may be good to update the Mango PR bigdatagenomics/mango#344 and make sure we have all the functionality we need in Mango included in this PR.

AmplabJenkins · 2018-01-24T16:20:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2592/
Test PASSed.

akmorrow13 · 2018-01-28T19:04:22Z

Jenkins, retest this please.

AmplabJenkins · 2018-01-28T19:22:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2611/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1878/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 7894e89 # timeout=10Checking out Revision 7894e89 (origin/pr/1878/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 7894e89 > /home/jenkins/git2/bin/git rev-list 8f53bfe # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins · 2018-01-30T09:18:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2621/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1878/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 345e43f # timeout=10Checking out Revision 345e43f (origin/pr/1878/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 345e43f > /home/jenkins/git2/bin/git rev-list 7894e89 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

removed addChrPrefix parameter Address PR comments - part 1 Address PR comments - part 2 fix nits Rebased Address review comments - part 3 address reviewer comments - white space and redundant asserts fixed isPartitioned

AmplabJenkins · 2018-01-30T09:52:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2622/
Test PASSed.

AmplabJenkins · 2018-01-31T13:35:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2626/
Test PASSed.

jpdna · 2018-02-02T18:36:49Z

Ping for further review.

akmorrow13 · 2018-02-02T19:42:08Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+   *
+   * @param filePath Path to save the file at.
+   */
+  def writePartitionedParquetFlag(filePath: String): Boolean = {


should this be private?

agree, done.

AmplabJenkins · 2018-02-02T21:46:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2632/
Test PASSed.

heuermh · 2018-02-07T23:43:33Z

As discussed earlier, here are two alternatives for style changes for the load methods:

Always return dataset bound RDD (always-return-dataset-bound.patch.txt)

def loadPartitionedParquetAlignments(
  pathName: String,
  regions: Iterable[ReferenceRegion] = Iterable.empty): AlignmentRecordRDD = {

  require(isPartitioned(pathName), s"Input Parquet files ($pathName) are not partitioned.")

  val reads = loadParquetAlignments(pathName, optPredicate = None, optProjection = None)

  val dataset = if (regions.nonEmpty) {
    reads.dataset.filter(referenceRegionsToDatasetQueryString(regions))
  } else {
    reads.dataset
  }

  DatasetBoundAlignmentRecordRDD(dataset, reads.sequences, reads.recordGroups, reads.processingSteps)
}

Return unbound or dataset bound RDD (return-unbound-or-dataset-bound.patch.txt)

def loadPartitionedParquetAlignments(
  pathName: String,
  regions: Iterable[ReferenceRegion] = Iterable.empty): AlignmentRecordRDD = {

  val reads = loadParquetAlignments(pathName, optPredicate = None, optProjection = None)

  val filteredReads = if (regions.nonEmpty) {
    require(isPartitioned(pathName), s"Input Parquet files ($pathName) are not partitioned.")

    DatasetBoundAlignmentRecordRDD(
      reads.dataset.filter(referenceRegionsToDatasetQueryString(regions)),
      reads.sequences,
      reads.recordGroups,
      reads.processingSteps
    )
  } else {
    reads
  }

  filteredReads
}

heuermh · 2018-02-07T23:44:48Z

Sorry, github isn't allowing me to upload the referred to patches, will send via email.

jpdna · 2018-02-09T12:26:33Z

Thanks @heuermh!
I'm going with the first option that always returns DatasetBoundAlignmentRecordRDD because we cannot allow possibility of .rdd being called directly on a 'ParquetUnboundAlignmentRecordRDD' backed by partitioned parquet, as the partitioned parquet can only be read as a dataset, an attempt to read as an rdd will cause an error. Happily, once you have a DatasetBoundAlignmentRecordRDD then '.rdd' will work to convert it to an rdd.

…Parquet return DatasetBound

AmplabJenkins · 2018-02-09T19:32:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2657/
Test PASSed.

heuermh · 2018-02-09T19:50:53Z

I'm going with the first option that always returns DatasetBoundAlignmentRecordRDD because we cannot allow possibility of .rdd being called directly on a 'ParquetUnboundAlignmentRecordRDD' backed by partitioned parquet, as the partitioned parquet can only be read as a dataset, an attempt to read as an rdd will cause an error.

We must not have good enough unit test coverage then, because both patches passed all unit tests. :)

AmplabJenkins · 2018-02-09T20:48:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2660/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1878/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains d22b9ff # timeout=10Checking out Revision d22b9ff (origin/pr/1878/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f d22b9ff > /home/jenkins/git2/bin/git rev-list cba6e71 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

jpdna · 2018-02-10T06:56:12Z

FYI - I have a working branch where the filtering has been moved into a filterByOverlappingRegions in GenomicDataset as discussed.
I'll push it tomorrow.

jpdna · 2018-02-13T12:40:42Z

Replaced by #1911

jpdna mentioned this pull request Jan 19, 2018

Hive partitioned(v4) rebased #1864

Closed

heuermh requested changes Jan 19, 2018

View reviewed changes

fnothaft requested changes Jan 21, 2018

View reviewed changes

fnothaft mentioned this pull request Jan 21, 2018

HBase seperate module PR #1388

Closed

fnothaft added this to the 0.24.0 milestone Jan 21, 2018

heuermh changed the title ~~hive-style partitioning of parquet files by genomic position~~ [ADAM-651] Hive-style partitioning of parquet files by genomic position Jan 22, 2018

jpdna force-pushed the hive_partitioned_v5 branch from 02abc77 to 2a4c022 Compare January 23, 2018 18:56

akmorrow13 reviewed Jan 23, 2018

View reviewed changes

hive-style partitioning of parquet files by genomic position

23a3bcc

removed addChrPrefix parameter Address PR comments - part 1 Address PR comments - part 2 fix nits Rebased Address review comments - part 3 address reviewer comments - white space and redundant asserts fixed isPartitioned

jpdna force-pushed the hive_partitioned_v5 branch from 309a49f to 23a3bcc Compare January 30, 2018 09:23

jpdna mentioned this pull request Jan 30, 2018

Mango using Partitioned parquet ADAM bigdatagenomics/mango#358

Closed

fixed codacy issue with return in isPartitioned

2697c7a

akmorrow13 reviewed Feb 2, 2018

View reviewed changes

made writePartitionedParquetFlag private

80872b8

add filterByOverlapRegions to GenomicsDataset Trait and made loadPart…

5a2be53

…Parquet return DatasetBound

changed ppAR return time back to ARRDD

4d2ae62

jpdna mentioned this pull request Feb 13, 2018

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

Closed

jpdna closed this Feb 13, 2018


		def referenceRegionsToDatasetQueryString(regions: Iterable[ReferenceRegion], partitionSize: Int = 1000000): String = {

		var regionQueryString = "(contigName=" + "\'" + regions.head.referenceName + "\' and posBin >= \'" +

		@@ -857,4 +888,5 @@ class NucleotideContigFragmentRDDSuite extends ADAMFunSuite {

		checkSave(variantContexts)
		}

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1878

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1878

Conversation

jpdna commented Jan 19, 2018 • edited by heuermh Loading

coveralls commented Jan 19, 2018 • edited Loading

AmplabJenkins commented Jan 19, 2018

heuermh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 20, 2018 • edited Loading

AmplabJenkins commented Jan 20, 2018

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 23, 2018

AmplabJenkins commented Jan 23, 2018

AmplabJenkins commented Jan 23, 2018

AmplabJenkins commented Jan 23, 2018

jpdna commented Jan 23, 2018

AmplabJenkins commented Jan 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akmorrow13 commented Jan 23, 2018

AmplabJenkins commented Jan 24, 2018

akmorrow13 commented Jan 28, 2018

AmplabJenkins commented Jan 28, 2018

Build result: FAILURE

AmplabJenkins commented Jan 30, 2018

Build result: FAILURE

AmplabJenkins commented Jan 30, 2018

AmplabJenkins commented Jan 31, 2018

jpdna commented Feb 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 2, 2018

heuermh commented Feb 7, 2018

heuermh commented Feb 7, 2018

jpdna commented Feb 9, 2018

AmplabJenkins commented Feb 9, 2018

heuermh commented Feb 9, 2018

AmplabJenkins commented Feb 9, 2018

Build result: FAILURE

jpdna commented Feb 10, 2018

jpdna commented Feb 13, 2018

jpdna commented Jan 19, 2018 •

edited by heuermh

Loading

coveralls commented Jan 19, 2018 •

edited

Loading

coveralls commented Jan 20, 2018 •

edited

Loading