Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1011] Refactor to add GenomicRDDs for all Avro types #1051

Merged
merged 13 commits into from
Jul 18, 2016

Conversation

fnothaft
Copy link
Member

A bit of work towards #1011. I'll be dropping the rest of #1011 onto here. Not ready for merge yet, but ready for first review.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1275/
Test PASSed.

replaceRdd(tFn(rdd))
}

protected def replaceRdd(newRdd: RDD[T]): U
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain (again, sorry) what transform and replaceRdd are for?

In the context of #1040, I was thinking the pattern would be

val foos: FooRDD = sc.loadFoo("...")
val rdd: RDD[Foo] = foos.rdd
rdd.map(//... general Spark RDD methods
foos.adamSpecificMethods() // ADAM-specific methods from FooRDD
foos.saveAsBar()

I.e. that the caller wouldn't mix RDD and FooRDD method calls. If I understand correctly, transform allows the caller to

foos.adamSpecificMethod().transform(_.map(/*... general Spark RDD methods */))

Is that a common enough pattern in scala? Will this be obvious to clients of our APIs? Will this work when adapted for Java APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This riffs off the spark.ml pipelines work: http://spark.apache.org/docs/latest/ml-guide.html#transformers

In:

val foos: FooRDD = sc.loadFoo("...")
val rdd: RDD[Foo] = foos.rdd
rdd.map(//... general Spark RDD methods
foos.adamSpecificMethods() // ADAM-specific methods from FooRDD
foos.saveAsBar()

You'd need to create a new FooRDD with the product of rdd.map(...).

@fnothaft
Copy link
Member Author

Ping @jpdna @akmorrow13 @ryan-williams for review

@akmorrow13
Copy link
Contributor

+1

fnothaft added 5 commits July 7, 2016 17:38
* Add NucleotideContigFragmentRDD to replace RDD[NucleotideContigFragment] and
  updated various ADAMContext functions.
* Added test suite for Fasta2ADAM to validate round trip loading.
* Added Feature and Gene RDDs and updated loading functions.
* Modified Features2ADAMSuite to use sc.loadFeatures to test end-to-end
  conversion.
* Added FragmentRDD to replace RDD[Fragment]; refactored load/save methods
* Moved toFragments method from implicit on RDD[AlignmentRecord] to AlignmentRecordRDD
* Added `-single` option to Fragments2Reads.
* Added test suites for Fragments2Reads and Reads2Fragments.
@fnothaft fnothaft force-pushed the all-genomic-rdds branch from 5af2659 to a4a94ff Compare July 7, 2016 21:38
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1330/
Test PASSed.

* Made region join implementations package private to `org.bdgenomics.adam.rdd`.
* Added support for `broadcastRegionJoin`, `shuffleRegionJoin`, and
  `filterByOverlappingRegion` to `GenomicRDD`.
* Added private `GenericGenomicRDD` class.
* Cleaned up `CalculateDepth`, which used the old `broadcastRegionJoin`
  implementation.
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1331/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1332/

Build result: FAILURE

GitHub pull request #1051 of commit 5fc859b automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1051/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 89c7196aa36e4344beceac2be6d127015b394ede # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1051/merge^{commit} # timeout=10Checking out Revision 89c7196aa36e4344beceac2be6d127015b394ede (origin/pr/1051/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 89c7196aa36e4344beceac2be6d127015b394edeFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the all-genomic-rdds branch from 5fc859b to 6ba6686 Compare July 8, 2016 19:19
@fnothaft
Copy link
Member Author

fnothaft commented Jul 8, 2016

Poof! AlignmentRecordRDDFunctions is gone!

@heuermh your hunch about this making the Java API much easier to implement seems to be accurate.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1333/
Test PASSed.

@fnothaft
Copy link
Member Author

fnothaft commented Jul 8, 2016

@heuermh see bd56167 for the compound genomic RDD bit we were talking about.

@fnothaft
Copy link
Member Author

fnothaft commented Jul 9, 2016

Just org.bdgenomics.adam.rdd.variation to go... So. Close.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1334/
Test PASSed.

@fnothaft
Copy link
Member Author

fnothaft commented Jul 9, 2016

Almost done! Have all the *RDDFunctions classes factored out, I'm going to finish #1040 tomorrow morning.

@fnothaft
Copy link
Member Author

fnothaft commented Jul 9, 2016

I would say that this is good for final review. The work for #1040 is going to be a single commit and will not be very intrusive. It will touch a lot of files, but not in any sort of important way, and it mostly touches test suites.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1335/
Test PASSed.

@fnothaft fnothaft force-pushed the all-genomic-rdds branch from 1218d3f to 5cd09f5 Compare July 9, 2016 13:38
@fnothaft
Copy link
Member Author

fnothaft commented Jul 9, 2016

All done! Resolves #1011 and #1040. Please review.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1336/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jul 9, 2016

Nice work! This is a big one, will take a day or two to review.

@tdanford @jpdna @akmorrow13 @ryan-williams @laserson @massie Calling in for backup!

@fnothaft
Copy link
Member Author

fnothaft commented Jul 9, 2016

This is a big one, will take a day or two to review.

Yeah, with a change this big, I definitely want to get it right. Perhaps we can target having it reviewed by Wednesday in time for the next BDG standup?

@jpdna
Copy link
Member

jpdna commented Jul 14, 2016

+1 on this based on my review so far.
Are there any specific aspects to this PR we feel need further attention/review?

@fnothaft fnothaft force-pushed the all-genomic-rdds branch 2 times, most recently from 5aae6a8 to 9eee9ca Compare July 18, 2016 21:33
@fnothaft
Copy link
Member Author

@heuermh all review comments should be addressed. Thanks for making a last pass!

fnothaft added 7 commits July 18, 2016 14:33
* Removed `AlignmentRecordRDDFunctions` and moved all functions into
  `AlignmentRecordRDD`. Moved test suite along with the main class move.
* Removed `countTag` functions and `PrintTags` CLI.
* Removed `JavaAlignmentRecordRDD`. The Java API just uses the
  `AlignmentRecordRDD` now. To do this, I moved the Java friendly save methods
  and JavaRDD to `GenomicRDD`.
* Removed `NucleotideContigFragmentRDDFunctions` and moved all functions to
  `NucleotideContigFragmentRDD`. Moved test suite to accompany refactor.
* Added Java helper functions, and added functions for loading sequence data
  from the `JavaADAMContext`.
* Refactored JavaADAMContext test code to make it easier to test Java API.
* Fixed `ADAMContext.loadSequences` so that it correctly identifies `.fa.gz`
  and `.fasta.gz` file extensions.
* Fixed `ADAMFunSuite.copyResource` so that it correctly copies resource files
  with compound file extensions.
* Adding documentation to FragmentRDD.
* Added Java-friendly save method to FragmentRDD.
* Added method for loading Fragments from the `JavaADAMContext`.
* Removed `FeatureRDDFunctions` class and moved all methods into `FeatureRDD`.
  Renamed test suite to match, and added Java helper functions.
* Moved all `adam-*/src/test/resources/features/*` files into their respective
  `resources` directory. This is necessary due to an issue when using
  `copyResources` with subdirectories.
* Removed `ignore` on test in `org.bdgenomics.adam.cli.Features2ADAMSuite`.
* Removed `VariationRDDFunctions` and moved remaining methods to
  `VariantContextRDD`. Moved test suite code as well.
* Removed abstract class `ADAMSequenceDictionaryRDDAggregator`.
* Added Java helper methods to various variation RDDs and accompanying loader
  methods in `JavaADAMContext`.
* Miscellaneous documentation cleanup in `org.bdgenomics.adam.rdd.variation`.

Resolves bigdatagenomics#1011.
@fnothaft fnothaft mentioned this pull request Jul 18, 2016
6 tasks
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1343/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1344/
Test PASSed.

@heuermh heuermh changed the title Refactor to add GenomicRDDs for all Avro types [ADAM-1011] Refactor to add GenomicRDDs for all Avro types Jul 18, 2016
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1346/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@heuermh
Copy link
Member

heuermh commented Jul 18, 2016

Did the pull request rename retrigger J enkins? sorry if so . . .

@fnothaft
Copy link
Member Author

Seems so! Alas.

@fnothaft
Copy link
Member Author

Can't run too many builds, though, right?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1347/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jul 18, 2016

LGTM. Shall I merge now or did you want to squash some commits first?

@fnothaft
Copy link
Member Author

@heuermh I think I'd like to keep the history as is here. I'm cool with this one being merged via the good ol' green merge button.

@heuermh
Copy link
Member

heuermh commented Jul 18, 2016

<click>

@heuermh heuermh merged commit 810806c into bigdatagenomics:master Jul 18, 2016
@heuermh
Copy link
Member

heuermh commented Jul 18, 2016

Thank you, @fnothaft!

@fnothaft
Copy link
Member Author

w00t! Thank you, as well @heuermh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants