upgrade to Spark 1.3.1 #690

ryan-williams · 2015-06-01T16:50:57Z

seems smooth, but a couple old jackson deps needed excluding

fixes #665

ryan-williams · 2015-06-01T20:38:33Z

[ERROR] Non-resolvable parent POM: Could not find artifact org.bdgenomics.adam:adam-parent:pom:0.16.1-SNAPSHOT and 'parent.relativePath' points at wrong local POM @ line 4, column 13 -> [Help 2]

some kind of POM error? Can't tell if it's my fault, doesn't seem like it is from what I can tell…

fnothaft · 2015-06-01T23:26:08Z

pom.xml

@@ -473,6 +473,16 @@
        <artifactId>aws-java-sdk</artifactId>


Are we actually using this dependency right now? I don't think we are.

Might be preferable to just remove the dependency (as opposed to excluding things).

side note: Github is doing a really unfortunate thing and showing your old comments next to my new diffs

ryan-williams · 2015-06-01T23:42:55Z

rebased

ryan-williams · 2015-06-01T23:43:04Z

let me try removing that dep

ryan-williams · 2015-06-01T23:54:54Z

removed the dep, and excluding it from being brought in by utils-io.

on second thought, that might not be the correct thing to do? does utils-io actually need that dep? maybe I should just more narrowly exclude the jackson.core that comes with utils-io's AWS dep?

ryan-williams · 2015-06-02T00:06:36Z

I'm going to do what I said there and switch to just excluding the jackson things, not the aws-java-sdk, since utils-io does seem to use it.

ryan-williams · 2015-06-02T00:08:32Z

In other news I can't get a break with jenkins today!

The only thing that looks like an error is:

[ERROR] Error formatting /home/jenkins/workspace/ADAM-prb/SCALAVER/2.10/hadoop.version/1.0.4/label/centos/adam-cli/src/main/scala/org/bdgenomics/adam/cli/PrintADAM.scala: java.nio.charset.MalformedInputException: Input length = 1

Link.

I've seen that a bunch of times today on different PRs, no idea what it means.

ryan-williams · 2015-06-02T00:17:15Z

OK, I think this is a better version!

fnothaft · 2015-06-02T00:17:19Z

@ryan-williams

I'm going to do what I said there and switch to just excluding the jackson things, not the aws-java-sdk, since utils-io does seem to use it.

That sounds reasonable here. I've opened https://github.com/bigdatagenomics/utils/issues/41to remove the AWS SDK from utils-io. We do need the dependency right now, but I believe the piece of code that we need it for is dead.

fnothaft · 2015-06-02T00:18:29Z

OK, I think this is a better version!

+1, agreed!

ryan-williams · 2015-06-02T00:19:36Z

If the code that uses it in utils-io is never exercised (which is what I interpreted "dead code" to mean), then the coarser exclusion might have "worked" here, but yea, this seems safer/better.

fnothaft · 2015-06-02T00:21:15Z

@ryan-williams yeah, that's what I was meaning. The utils-io code that depends on the AWS SDK was part of the old S3 Parquet loader, which doesn't exist anymore. I agree though that this approach is safer.

ryan-williams · 2015-06-02T00:38:20Z

Jenkins!

What does it mean?

fnothaft · 2015-06-02T00:53:45Z

Jenkins, retest this please.

fnothaft · 2015-06-02T01:18:44Z

@ryan-williams I don't think the error you're seeing in:

Jenkins!

What does it mean?

Is the error. I'm seeing:

2015-06-01 17:59:57 WARN  LoadSnappy:46 - Snappy native library not loaded
2015-06-01 17:59:57 ERROR Executor:96 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
    at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
2015-06-01 17:59:57 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
    at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
2015-06-01 17:59:57 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
    at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

2015-06-01 17:59:57 ERROR TaskSetManager:75 - Task 0 in stage 0.0 failed 1 times; aborting job

This is only for the Hadoop 1.0.4 builds. I'm wondering if this is an issue in Spark?

fnothaft · 2015-06-02T01:33:53Z

I'm going to check this out locally and run with 1.0.4 and see if this failure reproduces.

fnothaft · 2015-06-02T01:41:46Z

Ahhhh... this fails locally (with the error Jenkins is reporting) for me when running:

mvn package -Dhadoop.version=1.0.4

Surely, they didn't break Hadoop 1.0.4 support in Spark 1.3.1, did they?

fnothaft · 2015-06-02T01:43:50Z

@ryan-williams I'm hoping to cut a 0.17.0 release tomorrow. Do you want to check and see if you can resolve the Spark 1.3.1/Hadoop 1.0.4 issue by tomorrow? If you can't, shall we slip this to 0.18.0?

ryan-williams · 2015-06-02T01:47:21Z

good call, thanks for catching. I'll see what I can do, though not sure how much time I'll have :-\

fnothaft · 2015-06-02T01:55:11Z

No worries. We can take more of a hack at this later.

ryan-williams · 2015-06-02T16:24:24Z

So, I am tentatively thinking that depending on Spark 1.3.1 excludes the possibility to build against Hadoop 1.0.4. This user@spark thread is relevant, and I just responded cc'ing @fnothaft, explaining the failure we're seeing and linking to this PR.

Spark 1.3.1 is "provided" and not bringing in the Hadoop2.2 dependency that it is hard-coded into it, which means that any Spark code-paths we exercise run the risk of using bits of Hadoop2 that might not have existed in a compatible way in Hadoop1.0.4.

I suspect there's a way to shade Hadoop2.2 into Spark so that this would work for us, but I assume that would have to be done in Spark's POM; I've not seen Maven functionality related to shading an arbitrary version of a dep's dep into said dep, if that makes sense.

Curious to hear others' thoughts! How bad is the prospect of dropping Hadoop1.0.4 support?

ADAM-1.0 bad?
or just ADAM-0.1[78].0 bad?
or: [fork Spark, shade Hadoop2 into it, and publish and link ADAM against that] bad?
or: [publish two different versions of ADAM, for Spark <1.3 and Spark >=1.3] bad?

fnothaft · 2015-06-02T22:38:06Z

So, I am tentatively thinking that depending on Spark 1.3.1 excludes the possibility to build against Hadoop 1.0.4. This user@spark thread is relevant, and I just responded cc'ing @fnothaft, explaining the failure we're seeing and linking to this PR.

Yeah... I am moderately frustrated here. Either we're doing something wrong with how we're building against Spark 1.3.1, or their testing infrastructure is really weak. While it would be a lie to say that running on Hadoop 1.0.4 has always been seamless, it's generally worked OK.

What is possible is that we're running into a "build-only" problem. E.g., it is possible that the Maven artifact for Spark org.apache.spark/spark-core:1.3.1 is only compatible with Hadoop 2.x, but the Apache release artifacts are valid (e.g., http://d3kbcqa49mib13.cloudfront.net/spark-1.3.1-bin-hadoop1.tgz will not have errors). If this is the case, perhaps we could resolve it by moving Spark and Hadoop from provided to system dependencies? We'd then need something in our build to check that SPARK_HOME is defined and is valid (or something like that), but that should be straightforward.

Spark 1.3.1 is "provided" and not bringing in the Hadoop2.2 dependency that it is hard-coded into it, which means that any Spark code-paths we exercise run the risk of using bits of Hadoop2 that might not have existed in a compatible way in Hadoop1.0.4.

We run a matrix build in Jenkins that sweeps the Hadoop version across 1.0.4, 2.2.0, 2.3.0. I believe this is sufficient for testing; it did catch the issue we would've run into here, and has caught similar past issues. Perhaps I'm misunderstanding what you're getting at?

I suspect there's a way to shade Hadoop2.2 into Spark so that this would work for us, but I assume that would have to be done in Spark's POM; I've not seen Maven functionality related to shading an arbitrary version of a dep's dep into said dep, if that makes sense.

I don't think there's a straightforward way to do that. I think the best solution would be a fix to how Spark distributes resources. E.g., Spark provides pre-built artifacts for different Hadoop versions. While they're building these artifacts, they could also deploy those same artifacts to Maven Central, with a classifier that corresponds to the Hadoop version.

ryan-williams · 2015-06-02T23:10:26Z

I think we have the same understanding though we may have talked past each other a bit.

it is possible that the Maven artifact for Spark org.apache.spark/spark-core:1.3.1 is only compatible with Hadoop 2.x

Exactly. @srowen basically asserts this in his first email on the thread I linked previously (I can't link directly to it on that site; here it is on apache.org):

Yes, the published artifacts can only refer to one version of anything (OK, modulo publishing a large number of variants under classifiers).

So Spark is deliberately only publishing artifacts built for one version of Hadoop at a time, though they test against a matrix similar to ADAM's, I think. OTOH, Spark and ADAM both follow the common convention of actually shipping separate artifacts for Scala 2.10 vs. 2.11.

Finally, to your implication that Spark doesn't run tests for multiple versions of Hadoop, I also didn't think that was the case but can't find evidence to the contrary poking around Spark PRs and Jenkins setup; could be a reasonable other question to ask if we ever get a response on the aforelinked thread.

we could resolve it by moving Spark and Hadoop from provided to system dependencies

From the little I know of Maven and provided vs. system deps, I think I would prefer that we just built the Spark we want and linked against that.

The best case is that there is an already-published spark-core 1.3.1 JAR that works for Hadoop1 that we can link against, and that we believe that whatever published that JAR will continue to do so for future Spark versions.

On that note, I'm curious about where you found the http://d3kbcqa49mib13.cloudfront.net/spark-1.3.1-bin-hadoop1.tgz you linked to?

fnothaft · 2015-06-03T00:56:48Z

@ryan-williams
I do believe (but am not 100% sure) that I've run avocado against Spark 1.3.1/Hadoop 1.0.4 and saved data. I think there are two possibilities:

The Spark 1.3.1/Hadoop 1.0.4 release artifacts are OK, but Spark 1.3.1/Hadoop 2.2.0 Maven artifacts cannot be used with Hadoop 1.0.4 (exclude Hadoop from Spark, provide explicit dependency on Hadoop 1.0.4). Since Spark and Hadoop are a provided dependency, the unit tests fail (since they rely on the artifacts that Maven pulls down) but the end-to-end pipeline is OK (because it depends on the provided Spark 1.3.1/Hadoop 1.0.4 release artifacts).
Spark 1.3.1 cannot support Hadoop 1.0.4 because of the issue fixed by [SPARK-8057][Core]Call TaskAttemptContext.getTaskAttemptID using Reflection apache/spark#6599.

@srowen

In the first instance, you're talking about ADAM itself right? yes in general applications that use Hadoop need a recompile to work on 1.x versus 2.x. Spark also has to be compiled separately to run on Hadoop 1.x vs 2.x. But applications that run on Spark should not since they just interface with Spark, and the bits of the Hadoop API that Spark exposes don't vary from 1.x to 2.x. Of course, if an app separately used Hadoop APIs that did change from 1.x to 2.x they'd face the same problem.

Yes, agreed. I think my point was unclear in my original comment, but what I meant this comment to illustrate is why we needed an explicit dependency on Hadoop in our pom.

The error you show indicates that the deployed Spark is expecting 2.x but is getting 1.x at runtime. What are you deploying? it should be a Spark build for the 1.x you run in this case. The Maven artifacts aren't relevant.

Our unit tests are running using the Maven artifacts for Spark 1.3.1, with Hadoop excluded. We then have an explicit dependency on Hadoop version 1.0.4, so that we can build the correct input/output formats. The Maven artifacts are relevant because this error is coming up during a unit test run. Although the provided designation in Maven means that the final produced artifacts will not contain Hadoop or Spark, the unit tests will rely on the Maven artifacts for Hadoop and Spark.

This is an internal API call in Spark so how can it affect your app? I think the disconnect is either that you are trying to package Spark, or, that ADAM is more than an app (like it has direct Hadoop dependencies) and that is causing similar but separate issues. Yes you'd have to cross-compile too, yes you'd have to match up with the right Spark cross-compilation at runtime.

Historically, excluding Hadoop from Spark and then explicitly setting the Hadoop version was OK and would build and run unit tests fine. I think the problem we're running into here isn't an actual problem when we run ADAM as an app. I think it is solely a build problem that causes our unit tests to fail. I'm guessing that we could fix the issue by moving Spark from a provided to a system dependency, and then requiring the Spark published release to exist on the system. This would make us fully independent of the Spark Maven artifacts, which is kind of ugly.

I don't agree with apache/spark#6599 for the same reasons. There is no such thing as using the Maven artifact on Hadoop 1. Deployable assemblies are what gets run on a cluster. I guess I am not getting the problem yet, at least given the deployment model Spark requires.

I'm in agreement here. I don't think we're seeing a problem that would manifest during deployment (since the dependancies are provided), rather I think we're seeing a build issue (that causes our unit tests to fail on Hadoop 1.0.4). That being said, I think that the way we'd need to fix it downstream (in ADAM, that is) is really ugly (the above solution with system dependencies), and I think the best way to handle this would be for Spark to publish Maven artifacts (under a selector) for the spectrum of Hadoop versions.

srowen · 2015-06-03T01:04:50Z

Yeah, so you're making an ad-hoc "deployment" from Maven artifacts for tests. As far as I know that was never guaranteed to work and would surprise me if it had been across 1.x and 2.x at the same time. Maybe it very nearly does, and theory aside, a little change might make it still happen to work and that's worth it. It's brittle though and I expect that 1.x support will just go away more readily than 2-3x more artifacts are published for this. No idea about spark-ec2. It's a different rhetorical question, but that's such an old Hadoop version, is it important?

fnothaft · 2015-06-03T01:18:10Z

Yeah, so you're making an ad-hoc "deployment" from Maven artifacts for tests. As far as I know that was never guaranteed to work and would surprise me if it had been across 1.x and 2.x at the same time. Maybe it very nearly does, and theory aside, a little change might make it still happen to work and that's worth it.

I don't know if it's ever been "guaranteed" to work, but we've done this split Hadoop 1.0.4/2.2.0/2.3.0 build for CI without issues since pre-Spark 1.0.0 IIRC.

It's brittle though and I expect that 1.x support will just go away more readily than 2-3x more artifacts are published for this. No idea about spark-ec2. It's a different rhetorical question, but that's such an old Hadoop version, is it important?

The reason I mention spark-ec2 is because spark-ec2 is the only reason we've continued building for 1.0.4. Specifically, spark-ec2 uses 1.0.4 and has bad support for the Hadoop 2.x stream. In any case, spark-ec2 is fragile, and we're trying to move away from it.

The reason I suggest publishing more artifacts is that Spark is publishing these artifacts already (to the Spark download page); it just isn't publishing them to Maven. It seems to me like this should be a fairly seamless change to make, but:

My assumptions about the ease of adding this to the Spark release process may be incorrect.
It sounds like Spark is planning to formally drop Hadoop 1.x support (although, has this been communicated anywhere? I'm not surprised that this is the plan, but I hadn't heard about it), which would render the issue moot.

Anyways, Spark isn't going to retroactively push Spark 1.3.1 artifacts to Maven with a selector for Hadoop 1.x, so this discussion is somewhat hypothetical. On a more concrete level, it sounds like our choices are:

Drop official Hadoop 1.x support in ADAM post 0.17.0 (and, implicitly drop support for running ADAM on a spark-ec2 cluster)
Do a hacky build that would depend on system dependencies for Spark in Maven.

srowen · 2015-06-03T13:46:20Z

I should probably direct comment to apache/spark#6599 since here I'm just speaking for myself, and that's not necessarily representative. I can see the very practical argument that a small fix is worth it to preserve some existing behavior even if not guaranteed; my only open question is whether that's really all this needs, but it may be so.

Yeah I don't know much about spark-ec2. I would assume it can / will work with Hadoop 2 if needed, especially since Hadoop 1 is getting so old.

I think it's only moderately hard to add more Maven artifacts but it sure makes a mess -- Spark would deploy almost 100 artifacts per version. Now you have to change the artifacts you even depend on based on the version of another thing -- that's hard to write in Maven. It also exacerbates the Scala 2.10/2.11 problem for which there's really no other answer. This is why I can appreciate that anything reasonable to avoid this is worthwhile.

I speak for myself when I say I think 1.x support is going away, but I do think it's true in the medium term. It doesn't get much usage; 2.2 is already the default; old YARN support already went away.

ryan-williams · 2015-06-03T15:48:43Z

@srowen your point makes sense that we are:

pulling in a Spark that was explicitly built against Hadoop 2,
giving it Hadoop 1 at runtime,
observing that it doesn't work, due to the Hadoop-version mismatch.

I agree that reflection whack-a-mole in Spark to try to make all Hadoop API calls {1,2}-cross-compatible is not a good solution, even if it is just the one call in spark#6599 that doesn't work today.

However, I'm curious about your hesitance to just fix this in Maven:

Now you have to change the artifacts you even depend on based on the version of another thing

This is true, but the "another thing" in question (i.e. Hadoop version) affects whether programs will execute successfully or not; such a thing seems absolutely worth specifying explicitly, to me.

Our (ADAM's) problems here start with 1. above: we have no choice but to link against a Spark built for Hadoop 2 (or use Maven in a brittler/messier way, e.g. manually downloading Spark artifacts as system deps); given the choice, we (ADAM) would happily adjust our POMs and workflows to link against the correct Spark JAR for each Hadoop version we want to support.

that's hard to write in Maven

IANA Maven expert, but won't this just mean we'll have to have profiles for the version of Hadoop we want (hopefully very coarse-grained, e.g. -Phadoop1 and a default Hadoop-2.*)? I thought this was solvable and something Spark already does in cases where it's simply not possible/feasible to release one Spark JAR that will work in different runtimes.

Most importantly, said profiles only need exist in things that depend on Spark, e.g. ADAM; they wouldn't require any changes to Spark other than the publishing of artifacts to Maven that are already built and published to https://spark.apache.org/downloads.html.

srowen · 2015-06-03T15:52:55Z

I think apache/spark#6599 is a much easier way forward, if it really does do the trick, since it only has to last until 1.x goes away.

Yes, specifying dependency versions is vital in general. This makes it significantly harder to set Hadoop version since you are not only setting Hadoop version, but must apply one Maven profile or the other to get the correct Spark version. Defaults won't even work. A hard solution to a problem that must be solved is better than no solution but the PR above is much more inocuous.

ryan-williams · 2015-06-03T16:01:42Z

@srowen fair enough. We are also assuming that spark#6599 fixes the only incompatible Hadoop API call that we are going to run in to here; I'll build a Spark 1.3.1 with spark#6599 patched in and run ADAM tests against it now, to check.

ryan-williams · 2015-06-03T17:30:16Z

I ran ADAM tests against Spark-1.3.1+#6599 and Hadoop 1.0.4; the crash reported above here is fixed but the same issue crops up elsewhere:

java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.bdgenomics.adam.io.SingleFastqInputFormat.createRecordReader(SingleFastqInputFormat.java:291)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This time it's ADAM code that is calling a method on a TaskAttemptContext: SingleFastqInputFormat.java:291, so the "null hypothesis" that Spark is fixed by spark#6599 remains unrefuted.

I'll try to work around the issue in that ADAM code path and see where we are then.

ryan-williams · 2015-06-03T18:03:24Z

OK @fnothaft, check out the 2nd commit here (853b45c): it creates a wrapper for TaskAttemptContext.setStatus in the style of spark#6599 and uses it in SingleFastqInputFormat.

This gets all the tests passing, locally at least, and when run against a Spark-1.3.1 with spark#6599 patched in.

Totally mysterious to me is why the same call in InterleavedFastqInputFormat is not causing tests to fail; without this hack, I saw SingleFastqInputFormatSuite, AlignmentRecordRDDFunctionsSuite, and ADAMContextSuite failing without this patch (full test failure output), but SingleFastqInputFormatSuite and InterleavedFastqInputFormatSuite look basically identical to me.

massie · 2015-06-03T18:14:01Z

@ryan-williams I think that wrapping the Hadoop Job using ContextUtil.getConfiguration(job) will do the trick here. The Hadoop 1 and 2 API incompatibilities included changing a class to an interface (sigh). The ContextUtil helper will use reflection to work around the mess.

ryan-williams · 2015-06-03T19:11:49Z

@massie interesting, I was not aware of the ContextUtil you speak of; you're referring to parquet.hadoop.util.ContextUtil, already used elsewhere in ADAM, I'm inferring?

I'm afraid I don't completely follow how you're suggesting I use it here; will it more cleanly work around:

the internal Spark use of TaskAttemptContext that was crashing at SparkHadoopMapRedUtil.scala:95?
ADAM's use of TaskAttemptContext at SingleFastqInputFormat.java:291 (and also possibly InterleavedFastqInputFormat.java:314)?

If you can provide more details I'm happy to try to use it instead, I believe that there is a better way than what I've done here!

massie · 2015-06-03T19:24:22Z

I was referring to parquet.hadoop.util.ContextUtil. Your ADAMHadoopUtil looks like a better way to go. Sorry for the github comment noise.

ryan-williams · 2015-06-03T19:26:56Z

ah ok, no worries, thanks for pointing that out to me, I am uninitiated to the world of Hadoop 1-vs-2 incompatibilities and workarounds thereof!

fnothaft · 2015-06-03T21:59:40Z

@ryan-williams

This time it's ADAM code that is calling a method on a TaskAttemptContext: SingleFastqInputFormat.java:291, so the "null hypothesis" that Spark is fixed by spark#6599 remains unrefuted.

Just to confirm, when you built ADAM, you did provide the -Dhadoop.version=1.0.4 on the Maven command line? If you did not force the Hadoop version in the ADAM build, I would expect this error. If you do force the Hadoop version, then you should get a clean build.

@srowen

Yeah I don't know much about spark-ec2. I would assume it can / will work with Hadoop 2 if needed, especially since Hadoop 1 is getting so old.

Technically speaking, spark-ec2 does "support" Hadoop 2.0.0, but that's the only Hadoop 2 minor version it supports. Additionally, Hadoop MR is broken for Hadoop 2.0.0 in spark-ec2 (paging @tomwhite who can provide more details, if interested), which really severely limits the utility of spark-ec2's Hadoop 2.x support.

I think it's only moderately hard to add more Maven artifacts but it sure makes a mess -- Spark would deploy almost 100 artifacts per version.

Spark provides 7 pre-built versions, times two for the Scala 2.10/2.11 change, so it should be 14, not 100 artifacts, no? Perhaps I'm missing something.

Now you have to change the artifacts you even depend on based on the version of another thing -- that's hard to write in Maven.

Unless I'm misunderstanding what you're proposing, this is trivial to do in Maven via either profiles or classifiers.

Keep in mind that the problem we're discussing here only pops up if the project that is building against Spark also needs to build against Hadoop public interfaces (i.e., they define custom input formats). In most projects, you can use Spark as a provided dependency and just rely on the transitive Hadoop dependency and be fine. Since we depend on Hadoop anyways, we already have to parametrize our build for this. Adding a second level of parametrization (parametrize which version of Spark to pull down based on a desired Hadoop version) would be easy to layer on top.

TL;DR: exactly what @ryan-williams said:

IANA Maven expert, but won't this just mean we'll have to have profiles for the version of Hadoop we want (hopefully very coarse-grained, e.g. -Phadoop1 and a default Hadoop-2.*)? I thought this was solvable and something Spark already does in cases where it's simply not possible/feasible to release one Spark JAR that will work in different runtimes.

I agree pretty much verbatim with @ryan-williams's comment upthread.

It also exacerbates the Scala 2.10/2.11 problem for which there's really no other answer. This is why I can appreciate that anything reasonable to avoid this is worthwhile.

Doing split builds for Scala 2.10/2.11 purely in Maven is difficult because by convention Scala projects change the artifact name with a change in major Scala version, which Maven doesn't support. However, it's fairly straight forward to script these split builds up (e.g., see https://github.com/bigdatagenomics/adam/blob/master/scripts/release/release.sh).

srowen · 2015-06-03T22:06:14Z

I'm talking about Maven artifacts, which are quite a separate issue from the pre-built assemblies you are linking to. These are not at all the same thing. Spark has ~22 Maven modules, times 2 Scala versions. Times 2 Hadoop versions is 88!

"Just" having to make 2 profiles and always having to specify at least 1 of them on a command line seems like a lot of pain for a dependency to require. Definitely doable but not fun, even for projects that are willing to do it.

I'd definitely prefer to hear that a few tweaks just make this problem go away rather than bother with Hadoop-specific artifacts, which I doubt will happen. Why is it preferable to have this mess spill into the Maven artifacts? This seems a few steps down the list of backup plans.

ryan-williams · 2015-06-03T22:25:08Z

@fnothaft good question, I'll double-check that I was seeing that error on a -Dhadoop.version=1.0.4 ADAM.

@srowen

"Just" having to make 2 profiles and always having to specify at least 1 of them on a command line seems like a lot of pain for a dependency to require.

What would "always" have to be specified on the cmdline? IIUC, ADAM would default to Hadoop-2 artifacts for everything, and have a -Phadoop-1.0.4 profile instead of the -Dhadoop.version=1.0.4 we have today… and ADAM is a relatively complex case.

Most things that depend on Spark are probably OK supporting one Hadoop (major) version at a time; they would just link to those artifacts in their POM and be done with it.

Anyway, I agree that merging in spark#6599 is simpler in the short term, so I think we're just debating whether the Maven route is

better in the long term, and
better enough in the long term to warrant the cost of publishing more Maven artifacts.

Seems like @fnothaft and I definitely believe 1), while 2) is debatable, but that you are skeptical about 1) because of the complexity you think it will add to things that depend on Spark…?

srowen · 2015-06-03T22:29:42Z

The gotcha is that all default profiles get disabled once any profile is
enabled. You can probably use a sys property to make this better I suppose.

I suspect that 1.x support will just be dropped before publishing 2x the
artifacts happens as this is much madness to support very few deployments.
IMHO.

On Thu, Jun 4, 2015, 12:25 AM Ryan Williams notifications@github.com
wrote:

@fnothaft https://github.com/fnothaft good question, I'll double-check
that I was seeing that error on a -Dhadoop.version=1.0.4 ADAM.

@srowen https://github.com/srowen

"Just" having to make 2 profiles and always having to specify at least 1
of them on a command line seems like a lot of pain for a dependency to
require.

What would "always" have to be specified on the cmdline? IIUC, ADAM would
default to Hadoop-2 artifacts for everything, and have a -Phadoop-1.0.4
profile instead of the -Dhadoop.version=1.0.4 we have today… and ADAM is
a relatively complex case.

Most things that depend on Spark are probably OK supporting one Hadoop
(major) version at a time; they would just link to those artifacts in their
POM and be done with it.

Anyway, I agree that merging in spark#6599 is simpler in the short term,
so I think we're just debating whether the Maven route is

better in the long term, and

better enough in the long term to warrant the cost of publishing
more Maven artifacts.

Seems like @fnothaft https://github.com/fnothaft and I definitely
believe 1), while 2) is debatable, but that you are skeptical about 1)
because of the complexity you think it will add to things that depend on
Spark…?

—
Reply to this email directly or view it on GitHub
#690 (comment).

tomwhite · 2015-06-04T07:09:35Z

To add to @fnothaft's comment about the spark-ec2 scripts: they are great for running Spark, but the version of MR that they install is old (1.x or 2.0, not the later 2.x versions) so it's probably best to avoid running MR jobs on a spark-ec2 cluster. ADAM doesn't have any MR jobs (although see #651), so it's not a problem.

Is there any reason not to drop Hadoop 1 support? For the spark-ec2 scripts HDFS installation, the Hadoop 2 version could be used. Projects like Crunch and Kite have dropped Hadoop 1 support, or are in the process of doing so.

fnothaft · 2015-06-04T14:37:22Z

@tomwhite
You'll need a MR cluster up if you're going to distcp data from S3, no? It just seems to me that if MR is broken on spark-ec2 Hadoop 2.0.0, we can't actually functionally use --hadoop-major-version=2 for spark-ec2.

@srowen

I'm talking about Maven artifacts, which are quite a separate issue from the pre-built assemblies you are linking to. These are not at all the same thing. Spark has ~22 Maven modules, times 2 Scala versions. Times 2 Hadoop versions is 88!

I know that they're separate. I don't know how Spark is doing their Maven releases (e.g., we use the OSS Sonatype Maven release flow, which may be very different from the Apache Maven release flow), but for us, releasing the parent project (adam-parent) is essentially a "one click" operation that releases all 4 of our modules. For us, even releasing Scala 2.10/2.11 artifacts is effectively a "one click" operation...

In any case, it sounds like there isn't interest in Spark distributing Maven artifacts that are separate for Hadoop 1.x and 2.x, so I'm going to drop this point. I suppose that the takeaway then is that we will drop Hadoop 1.x support in the next release of ADAM. We'll then need to shake out things downstream of that.

seems like a smooth transition, but unused AWS dep was pulling in a too-old jackson.core/jackson.annotations. fixes bigdatagenomics#665

ryan-williams · 2015-08-03T19:01:38Z

I'm picking this back up since it sounds like we're cool with dropping Hadoop 1 support from ADAM. cc @heuermh who also expressed interest in this.

heuermh · 2015-08-03T19:50:47Z

Great!

The jackson dependency issues should have been resolved via #744. I'm not sure what to do about bigdatagenomics/utils#41.

fnothaft · 2015-08-04T23:11:59Z

@ryan-williams @laserson @heuermh should we close this in favor of #753?

laserson · 2015-08-04T23:13:09Z

sgtm

fnothaft added this to the 0.17.0 milestone Jun 1, 2015

fnothaft reviewed Jun 1, 2015
View reviewed changes

ryan-williams force-pushed the spark-1.3.1 branch from 78b4f22 to 7d861cd Compare June 1, 2015 23:42

ryan-williams force-pushed the spark-1.3.1 branch from 7d861cd to debdb5d Compare June 1, 2015 23:53

ryan-williams force-pushed the spark-1.3.1 branch from debdb5d to 6525bfa Compare June 2, 2015 00:15

fnothaft mentioned this pull request Jun 2, 2015

Remove AWS dependency bigdatagenomics/utils#41

Closed

fnothaft modified the milestones: 0.18.0, 0.17.0 Jun 2, 2015

ryan-williams force-pushed the spark-1.3.1 branch from 6525bfa to 853b45c Compare June 3, 2015 17:56

upgrade to Spark 1.3.1

1f60b38

seems like a smooth transition, but unused AWS dep was pulling in a too-old jackson.core/jackson.annotations. fixes bigdatagenomics#665

ryan-williams force-pushed the spark-1.3.1 branch from 853b45c to cf85dbd Compare August 3, 2015 19:00

reflect TaskAttemptContext to support Hadoop[12]

a2ce3bd

ryan-williams force-pushed the spark-1.3.1 branch from cf85dbd to a2ce3bd Compare August 3, 2015 19:36

ryan-williams closed this Aug 6, 2015

ryan-williams deleted the spark-1.3.1 branch August 6, 2015 20:55

upgrade to Spark 1.3.1 #690

upgrade to Spark 1.3.1 #690

Conversation

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

fnothaft Jun 1, 2015

Choose a reason for hiding this comment

fnothaft Jun 1, 2015

Choose a reason for hiding this comment

ryan-williams Jun 1, 2015

Choose a reason for hiding this comment

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 1, 2015

ryan-williams commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

fnothaft commented Jun 2, 2015

fnothaft commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

fnothaft commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

fnothaft commented Jun 2, 2015

fnothaft commented Jun 2, 2015

fnothaft commented Jun 2, 2015

fnothaft commented Jun 2, 2015

fnothaft commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

fnothaft commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

fnothaft commented Jun 2, 2015

ryan-williams commented Jun 2, 2015

fnothaft commented Jun 3, 2015

srowen commented Jun 3, 2015

fnothaft commented Jun 3, 2015

srowen commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

srowen commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

massie commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

massie commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

fnothaft commented Jun 3, 2015

srowen commented Jun 3, 2015

ryan-williams commented Jun 3, 2015

srowen commented Jun 3, 2015

tomwhite commented Jun 4, 2015

fnothaft commented Jun 4, 2015

ryan-williams commented Aug 3, 2015

heuermh commented Aug 3, 2015

fnothaft commented Aug 4, 2015

laserson commented Aug 4, 2015