Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to Spark 1.3.1 #690

Closed

Conversation

ryan-williams
Copy link
Member

seems smooth, but a couple old jackson deps needed excluding

fixes #665

@ryan-williams
Copy link
Member Author

[ERROR] Non-resolvable parent POM: Could not find artifact org.bdgenomics.adam:adam-parent:pom:0.16.1-SNAPSHOT and 'parent.relativePath' points at wrong local POM @ line 4, column 13 -> [Help 2]

some kind of POM error? Can't tell if it's my fault, doesn't seem like it is from what I can tell…

@fnothaft fnothaft added this to the 0.17.0 milestone Jun 1, 2015
@@ -473,6 +473,16 @@
<artifactId>aws-java-sdk</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we actually using this dependency right now? I don't think we are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be preferable to just remove the dependency (as opposed to excluding things).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: Github is doing a really unfortunate thing and showing your old comments next to my new diffs

@ryan-williams
Copy link
Member Author

rebased

@ryan-williams
Copy link
Member Author

let me try removing that dep

@ryan-williams
Copy link
Member Author

removed the dep, and excluding it from being brought in by utils-io.

on second thought, that might not be the correct thing to do? does utils-io actually need that dep? maybe I should just more narrowly exclude the jackson.core that comes with utils-io's AWS dep?

@ryan-williams
Copy link
Member Author

I'm going to do what I said there and switch to just excluding the jackson things, not the aws-java-sdk, since utils-io does seem to use it.

@ryan-williams
Copy link
Member Author

In other news I can't get a break with jenkins today!

The only thing that looks like an error is:

[ERROR] Error formatting /home/jenkins/workspace/ADAM-prb/SCALAVER/2.10/hadoop.version/1.0.4/label/centos/adam-cli/src/main/scala/org/bdgenomics/adam/cli/PrintADAM.scala: java.nio.charset.MalformedInputException: Input length = 1

Link.

I've seen that a bunch of times today on different PRs, no idea what it means.

@ryan-williams
Copy link
Member Author

OK, I think this is a better version!

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

@ryan-williams

I'm going to do what I said there and switch to just excluding the jackson things, not the aws-java-sdk, since utils-io does seem to use it.

That sounds reasonable here. I've opened https://github.com/bigdatagenomics/utils/issues/41to remove the AWS SDK from utils-io. We do need the dependency right now, but I believe the piece of code that we need it for is dead.

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

OK, I think this is a better version!

+1, agreed!

@ryan-williams
Copy link
Member Author

If the code that uses it in utils-io is never exercised (which is what I interpreted "dead code" to mean), then the coarser exclusion might have "worked" here, but yea, this seems safer/better.

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

@ryan-williams yeah, that's what I was meaning. The utils-io code that depends on the AWS SDK was part of the old S3 Parquet loader, which doesn't exist anymore. I agree though that this approach is safer.

@ryan-williams
Copy link
Member Author

Jenkins!

What does it mean?

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

Jenkins, retest this please.

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

@ryan-williams I don't think the error you're seeing in:

Jenkins!

What does it mean?

Is the error. I'm seeing:

2015-06-01 17:59:57 WARN  LoadSnappy:46 - Snappy native library not loaded
2015-06-01 17:59:57 ERROR Executor:96 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
    at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
2015-06-01 17:59:57 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
    at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
2015-06-01 17:59:57 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
    at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

2015-06-01 17:59:57 ERROR TaskSetManager:75 - Task 0 in stage 0.0 failed 1 times; aborting job

This is only for the Hadoop 1.0.4 builds. I'm wondering if this is an issue in Spark?

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

I'm going to check this out locally and run with 1.0.4 and see if this failure reproduces.

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

Ahhhh... this fails locally (with the error Jenkins is reporting) for me when running:

mvn package -Dhadoop.version=1.0.4

Surely, they didn't break Hadoop 1.0.4 support in Spark 1.3.1, did they?

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

@ryan-williams I'm hoping to cut a 0.17.0 release tomorrow. Do you want to check and see if you can resolve the Spark 1.3.1/Hadoop 1.0.4 issue by tomorrow? If you can't, shall we slip this to 0.18.0?

@ryan-williams
Copy link
Member Author

good call, thanks for catching. I'll see what I can do, though not sure how much time I'll have :-\

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

No worries. We can take more of a hack at this later.

@ryan-williams
Copy link
Member Author

So, I am tentatively thinking that depending on Spark 1.3.1 excludes the possibility to build against Hadoop 1.0.4. This user@spark thread is relevant, and I just responded cc'ing @fnothaft, explaining the failure we're seeing and linking to this PR.

Spark 1.3.1 is "provided" and not bringing in the Hadoop2.2 dependency that it is hard-coded into it, which means that any Spark code-paths we exercise run the risk of using bits of Hadoop2 that might not have existed in a compatible way in Hadoop1.0.4.

I suspect there's a way to shade Hadoop2.2 into Spark so that this would work for us, but I assume that would have to be done in Spark's POM; I've not seen Maven functionality related to shading an arbitrary version of a dep's dep into said dep, if that makes sense.

Curious to hear others' thoughts! How bad is the prospect of dropping Hadoop1.0.4 support?

  1. ADAM-1.0 bad?
  2. or just ADAM-0.1[78].0 bad?
  3. or: [fork Spark, shade Hadoop2 into it, and publish and link ADAM against that] bad?
  4. or: [publish two different versions of ADAM, for Spark <1.3 and Spark >=1.3] bad?

@fnothaft
Copy link
Member

fnothaft commented Jun 2, 2015

So, I am tentatively thinking that depending on Spark 1.3.1 excludes the possibility to build against Hadoop 1.0.4. This user@spark thread is relevant, and I just responded cc'ing @fnothaft, explaining the failure we're seeing and linking to this PR.

Yeah... I am moderately frustrated here. Either we're doing something wrong with how we're building against Spark 1.3.1, or their testing infrastructure is really weak. While it would be a lie to say that running on Hadoop 1.0.4 has always been seamless, it's generally worked OK.

What is possible is that we're running into a "build-only" problem. E.g., it is possible that the Maven artifact for Spark org.apache.spark/spark-core:1.3.1 is only compatible with Hadoop 2.x, but the Apache release artifacts are valid (e.g., http://d3kbcqa49mib13.cloudfront.net/spark-1.3.1-bin-hadoop1.tgz will not have errors). If this is the case, perhaps we could resolve it by moving Spark and Hadoop from provided to system dependencies? We'd then need something in our build to check that SPARK_HOME is defined and is valid (or something like that), but that should be straightforward.

Spark 1.3.1 is "provided" and not bringing in the Hadoop2.2 dependency that it is hard-coded into it, which means that any Spark code-paths we exercise run the risk of using bits of Hadoop2 that might not have existed in a compatible way in Hadoop1.0.4.

We run a matrix build in Jenkins that sweeps the Hadoop version across 1.0.4, 2.2.0, 2.3.0. I believe this is sufficient for testing; it did catch the issue we would've run into here, and has caught similar past issues. Perhaps I'm misunderstanding what you're getting at?

I suspect there's a way to shade Hadoop2.2 into Spark so that this would work for us, but I assume that would have to be done in Spark's POM; I've not seen Maven functionality related to shading an arbitrary version of a dep's dep into said dep, if that makes sense.

I don't think there's a straightforward way to do that. I think the best solution would be a fix to how Spark distributes resources. E.g., Spark provides pre-built artifacts for different Hadoop versions. While they're building these artifacts, they could also deploy those same artifacts to Maven Central, with a classifier that corresponds to the Hadoop version.

@fnothaft fnothaft modified the milestones: 0.18.0, 0.17.0 Jun 2, 2015
@ryan-williams
Copy link
Member Author

I think we have the same understanding though we may have talked past each other a bit.

it is possible that the Maven artifact for Spark org.apache.spark/spark-core:1.3.1 is only compatible with Hadoop 2.x

Exactly. @srowen basically asserts this in his first email on the thread I linked previously (I can't link directly to it on that site; here it is on apache.org):

Yes, the published artifacts can only refer to one version of anything (OK, modulo publishing a large number of variants under classifiers).

So Spark is deliberately only publishing artifacts built for one version of Hadoop at a time, though they test against a matrix similar to ADAM's, I think. OTOH, Spark and ADAM both follow the common convention of actually shipping separate artifacts for Scala 2.10 vs. 2.11.

Finally, to your implication that Spark doesn't run tests for multiple versions of Hadoop, I also didn't think that was the case but can't find evidence to the contrary poking around Spark PRs and Jenkins setup; could be a reasonable other question to ask if we ever get a response on the aforelinked thread.

we could resolve it by moving Spark and Hadoop from provided to system dependencies

From the little I know of Maven and provided vs. system deps, I think I would prefer that we just built the Spark we want and linked against that.

The best case is that there is an already-published spark-core 1.3.1 JAR that works for Hadoop1 that we can link against, and that we believe that whatever published that JAR will continue to do so for future Spark versions.

On that note, I'm curious about where you found the http://d3kbcqa49mib13.cloudfront.net/spark-1.3.1-bin-hadoop1.tgz you linked to?

@fnothaft
Copy link
Member

fnothaft commented Jun 3, 2015

@ryan-williams
I do believe (but am not 100% sure) that I've run avocado against Spark 1.3.1/Hadoop 1.0.4 and saved data. I think there are two possibilities:

  1. The Spark 1.3.1/Hadoop 1.0.4 release artifacts are OK, but Spark 1.3.1/Hadoop 2.2.0 Maven artifacts cannot be used with Hadoop 1.0.4 (exclude Hadoop from Spark, provide explicit dependency on Hadoop 1.0.4). Since Spark and Hadoop are a provided dependency, the unit tests fail (since they rely on the artifacts that Maven pulls down) but the end-to-end pipeline is OK (because it depends on the provided Spark 1.3.1/Hadoop 1.0.4 release artifacts).
  2. Spark 1.3.1 cannot support Hadoop 1.0.4 because of the issue fixed by [SPARK-8057][Core]Call TaskAttemptContext.getTaskAttemptID using Reflection apache/spark#6599.

@srowen

In the first instance, you're talking about ADAM itself right? yes in general applications that use Hadoop need a recompile to work on 1.x versus 2.x. Spark also has to be compiled separately to run on Hadoop 1.x vs 2.x. But applications that run on Spark should not since they just interface with Spark, and the bits of the Hadoop API that Spark exposes don't vary from 1.x to 2.x. Of course, if an app separately used Hadoop APIs that did change from 1.x to 2.x they'd face the same problem.

Yes, agreed. I think my point was unclear in my original comment, but what I meant this comment to illustrate is why we needed an explicit dependency on Hadoop in our pom.

The error you show indicates that the deployed Spark is expecting 2.x but is getting 1.x at runtime. What are you deploying? it should be a Spark build for the 1.x you run in this case. The Maven artifacts aren't relevant.

Our unit tests are running using the Maven artifacts for Spark 1.3.1, with Hadoop excluded. We then have an explicit dependency on Hadoop version 1.0.4, so that we can build the correct input/output formats. The Maven artifacts are relevant because this error is coming up during a unit test run. Although the provided designation in Maven means that the final produced artifacts will not contain Hadoop or Spark, the unit tests will rely on the Maven artifacts for Hadoop and Spark.

This is an internal API call in Spark so how can it affect your app? I think the disconnect is either that you are trying to package Spark, or, that ADAM is more than an app (like it has direct Hadoop dependencies) and that is causing similar but separate issues. Yes you'd have to cross-compile too, yes you'd have to match up with the right Spark cross-compilation at runtime.

Historically, excluding Hadoop from Spark and then explicitly setting the Hadoop version was OK and would build and run unit tests fine. I think the problem we're running into here isn't an actual problem when we run ADAM as an app. I think it is solely a build problem that causes our unit tests to fail. I'm guessing that we could fix the issue by moving Spark from a provided to a system dependency, and then requiring the Spark published release to exist on the system. This would make us fully independent of the Spark Maven artifacts, which is kind of ugly.

I don't agree with apache/spark#6599 for the same reasons. There is no such thing as using the Maven artifact on Hadoop 1. Deployable assemblies are what gets run on a cluster. I guess I am not getting the problem yet, at least given the deployment model Spark requires.

I'm in agreement here. I don't think we're seeing a problem that would manifest during deployment (since the dependancies are provided), rather I think we're seeing a build issue (that causes our unit tests to fail on Hadoop 1.0.4). That being said, I think that the way we'd need to fix it downstream (in ADAM, that is) is really ugly (the above solution with system dependencies), and I think the best way to handle this would be for Spark to publish Maven artifacts (under a selector) for the spectrum of Hadoop versions.

@srowen
Copy link

srowen commented Jun 3, 2015

Yeah, so you're making an ad-hoc "deployment" from Maven artifacts for tests. As far as I know that was never guaranteed to work and would surprise me if it had been across 1.x and 2.x at the same time. Maybe it very nearly does, and theory aside, a little change might make it still happen to work and that's worth it. It's brittle though and I expect that 1.x support will just go away more readily than 2-3x more artifacts are published for this. No idea about spark-ec2. It's a different rhetorical question, but that's such an old Hadoop version, is it important?

@fnothaft
Copy link
Member

fnothaft commented Jun 3, 2015

Yeah, so you're making an ad-hoc "deployment" from Maven artifacts for tests. As far as I know that was never guaranteed to work and would surprise me if it had been across 1.x and 2.x at the same time. Maybe it very nearly does, and theory aside, a little change might make it still happen to work and that's worth it.

I don't know if it's ever been "guaranteed" to work, but we've done this split Hadoop 1.0.4/2.2.0/2.3.0 build for CI without issues since pre-Spark 1.0.0 IIRC.

It's brittle though and I expect that 1.x support will just go away more readily than 2-3x more artifacts are published for this. No idea about spark-ec2. It's a different rhetorical question, but that's such an old Hadoop version, is it important?

The reason I mention spark-ec2 is because spark-ec2 is the only reason we've continued building for 1.0.4. Specifically, spark-ec2 uses 1.0.4 and has bad support for the Hadoop 2.x stream. In any case, spark-ec2 is fragile, and we're trying to move away from it.

The reason I suggest publishing more artifacts is that Spark is publishing these artifacts already (to the Spark download page); it just isn't publishing them to Maven. It seems to me like this should be a fairly seamless change to make, but:

  1. My assumptions about the ease of adding this to the Spark release process may be incorrect.
  2. It sounds like Spark is planning to formally drop Hadoop 1.x support (although, has this been communicated anywhere? I'm not surprised that this is the plan, but I hadn't heard about it), which would render the issue moot.

Anyways, Spark isn't going to retroactively push Spark 1.3.1 artifacts to Maven with a selector for Hadoop 1.x, so this discussion is somewhat hypothetical. On a more concrete level, it sounds like our choices are:

  1. Drop official Hadoop 1.x support in ADAM post 0.17.0 (and, implicitly drop support for running ADAM on a spark-ec2 cluster)
  2. Do a hacky build that would depend on system dependencies for Spark in Maven.

@srowen
Copy link

srowen commented Jun 3, 2015

I should probably direct comment to apache/spark#6599 since here I'm just speaking for myself, and that's not necessarily representative. I can see the very practical argument that a small fix is worth it to preserve some existing behavior even if not guaranteed; my only open question is whether that's really all this needs, but it may be so.

Yeah I don't know much about spark-ec2. I would assume it can / will work with Hadoop 2 if needed, especially since Hadoop 1 is getting so old.

I think it's only moderately hard to add more Maven artifacts but it sure makes a mess -- Spark would deploy almost 100 artifacts per version. Now you have to change the artifacts you even depend on based on the version of another thing -- that's hard to write in Maven. It also exacerbates the Scala 2.10/2.11 problem for which there's really no other answer. This is why I can appreciate that anything reasonable to avoid this is worthwhile.

I speak for myself when I say I think 1.x support is going away, but I do think it's true in the medium term. It doesn't get much usage; 2.2 is already the default; old YARN support already went away.

@ryan-williams
Copy link
Member Author

@srowen your point makes sense that we are:

  1. pulling in a Spark that was explicitly built against Hadoop 2,
  2. giving it Hadoop 1 at runtime,
  3. observing that it doesn't work, due to the Hadoop-version mismatch.

I agree that reflection whack-a-mole in Spark to try to make all Hadoop API calls {1,2}-cross-compatible is not a good solution, even if it is just the one call in spark#6599 that doesn't work today.

However, I'm curious about your hesitance to just fix this in Maven:

Now you have to change the artifacts you even depend on based on the version of another thing

This is true, but the "another thing" in question (i.e. Hadoop version) affects whether programs will execute successfully or not; such a thing seems absolutely worth specifying explicitly, to me.

Our (ADAM's) problems here start with 1. above: we have no choice but to link against a Spark built for Hadoop 2 (or use Maven in a brittler/messier way, e.g. manually downloading Spark artifacts as system deps); given the choice, we (ADAM) would happily adjust our POMs and workflows to link against the correct Spark JAR for each Hadoop version we want to support.

that's hard to write in Maven

IANA Maven expert, but won't this just mean we'll have to have profiles for the version of Hadoop we want (hopefully very coarse-grained, e.g. -Phadoop1 and a default Hadoop-2.*)? I thought this was solvable and something Spark already does in cases where it's simply not possible/feasible to release one Spark JAR that will work in different runtimes.

Most importantly, said profiles only need exist in things that depend on Spark, e.g. ADAM; they wouldn't require any changes to Spark other than the publishing of artifacts to Maven that are already built and published to https://spark.apache.org/downloads.html.

@srowen
Copy link

srowen commented Jun 3, 2015

I think apache/spark#6599 is a much easier way forward, if it really does do the trick, since it only has to last until 1.x goes away.

Yes, specifying dependency versions is vital in general. This makes it significantly harder to set Hadoop version since you are not only setting Hadoop version, but must apply one Maven profile or the other to get the correct Spark version. Defaults won't even work. A hard solution to a problem that must be solved is better than no solution but the PR above is much more inocuous.

@ryan-williams
Copy link
Member Author

@srowen fair enough. We are also assuming that spark#6599 fixes the only incompatible Hadoop API call that we are going to run in to here; I'll build a Spark 1.3.1 with spark#6599 patched in and run ADAM tests against it now, to check.

@ryan-williams
Copy link
Member Author

I ran ADAM tests against Spark-1.3.1+#6599 and Hadoop 1.0.4; the crash reported above here is fixed but the same issue crops up elsewhere:

java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
    at org.bdgenomics.adam.io.SingleFastqInputFormat.createRecordReader(SingleFastqInputFormat.java:291)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This time it's ADAM code that is calling a method on a TaskAttemptContext: SingleFastqInputFormat.java:291, so the "null hypothesis" that Spark is fixed by spark#6599 remains unrefuted.

I'll try to work around the issue in that ADAM code path and see where we are then.

@ryan-williams
Copy link
Member Author

OK @fnothaft, check out the 2nd commit here (853b45c): it creates a wrapper for TaskAttemptContext.setStatus in the style of spark#6599 and uses it in SingleFastqInputFormat.

This gets all the tests passing, locally at least, and when run against a Spark-1.3.1 with spark#6599 patched in.

Totally mysterious to me is why the same call in InterleavedFastqInputFormat is not causing tests to fail; without this hack, I saw SingleFastqInputFormatSuite, AlignmentRecordRDDFunctionsSuite, and ADAMContextSuite failing without this patch (full test failure output), but SingleFastqInputFormatSuite and InterleavedFastqInputFormatSuite look basically identical to me.

@massie
Copy link
Member

massie commented Jun 3, 2015

@ryan-williams I think that wrapping the Hadoop Job using ContextUtil.getConfiguration(job) will do the trick here. The Hadoop 1 and 2 API incompatibilities included changing a class to an interface (sigh). The ContextUtil helper will use reflection to work around the mess.

@ryan-williams
Copy link
Member Author

@massie interesting, I was not aware of the ContextUtil you speak of; you're referring to parquet.hadoop.util.ContextUtil, already used elsewhere in ADAM, I'm inferring?

I'm afraid I don't completely follow how you're suggesting I use it here; will it more cleanly work around:

  1. the internal Spark use of TaskAttemptContext that was crashing at SparkHadoopMapRedUtil.scala:95?
  2. ADAM's use of TaskAttemptContext at SingleFastqInputFormat.java:291 (and also possibly InterleavedFastqInputFormat.java:314)?

If you can provide more details I'm happy to try to use it instead, I believe that there is a better way than what I've done here!

@massie
Copy link
Member

massie commented Jun 3, 2015

I was referring to parquet.hadoop.util.ContextUtil. Your ADAMHadoopUtil looks like a better way to go. Sorry for the github comment noise.

@ryan-williams
Copy link
Member Author

ah ok, no worries, thanks for pointing that out to me, I am uninitiated to the world of Hadoop 1-vs-2 incompatibilities and workarounds thereof!

@fnothaft
Copy link
Member

fnothaft commented Jun 3, 2015

@ryan-williams

This time it's ADAM code that is calling a method on a TaskAttemptContext: SingleFastqInputFormat.java:291, so the "null hypothesis" that Spark is fixed by spark#6599 remains unrefuted.

Just to confirm, when you built ADAM, you did provide the -Dhadoop.version=1.0.4 on the Maven command line? If you did not force the Hadoop version in the ADAM build, I would expect this error. If you do force the Hadoop version, then you should get a clean build.

@srowen

Yeah I don't know much about spark-ec2. I would assume it can / will work with Hadoop 2 if needed, especially since Hadoop 1 is getting so old.

Technically speaking, spark-ec2 does "support" Hadoop 2.0.0, but that's the only Hadoop 2 minor version it supports. Additionally, Hadoop MR is broken for Hadoop 2.0.0 in spark-ec2 (paging @tomwhite who can provide more details, if interested), which really severely limits the utility of spark-ec2's Hadoop 2.x support.

I think it's only moderately hard to add more Maven artifacts but it sure makes a mess -- Spark would deploy almost 100 artifacts per version.

Spark provides 7 pre-built versions, times two for the Scala 2.10/2.11 change, so it should be 14, not 100 artifacts, no? Perhaps I'm missing something.

Now you have to change the artifacts you even depend on based on the version of another thing -- that's hard to write in Maven.

Unless I'm misunderstanding what you're proposing, this is trivial to do in Maven via either profiles or classifiers.

Keep in mind that the problem we're discussing here only pops up if the project that is building against Spark also needs to build against Hadoop public interfaces (i.e., they define custom input formats). In most projects, you can use Spark as a provided dependency and just rely on the transitive Hadoop dependency and be fine. Since we depend on Hadoop anyways, we already have to parametrize our build for this. Adding a second level of parametrization (parametrize which version of Spark to pull down based on a desired Hadoop version) would be easy to layer on top.

TL;DR: exactly what @ryan-williams said:

IANA Maven expert, but won't this just mean we'll have to have profiles for the version of Hadoop we want (hopefully very coarse-grained, e.g. -Phadoop1 and a default Hadoop-2.*)? I thought this was solvable and something Spark already does in cases where it's simply not possible/feasible to release one Spark JAR that will work in different runtimes.

I agree pretty much verbatim with @ryan-williams's comment upthread.

It also exacerbates the Scala 2.10/2.11 problem for which there's really no other answer. This is why I can appreciate that anything reasonable to avoid this is worthwhile.

Doing split builds for Scala 2.10/2.11 purely in Maven is difficult because by convention Scala projects change the artifact name with a change in major Scala version, which Maven doesn't support. However, it's fairly straight forward to script these split builds up (e.g., see https://github.com/bigdatagenomics/adam/blob/master/scripts/release/release.sh).

@srowen
Copy link

srowen commented Jun 3, 2015

I'm talking about Maven artifacts, which are quite a separate issue from the pre-built assemblies you are linking to. These are not at all the same thing. Spark has ~22 Maven modules, times 2 Scala versions. Times 2 Hadoop versions is 88!

"Just" having to make 2 profiles and always having to specify at least 1 of them on a command line seems like a lot of pain for a dependency to require. Definitely doable but not fun, even for projects that are willing to do it.

I'd definitely prefer to hear that a few tweaks just make this problem go away rather than bother with Hadoop-specific artifacts, which I doubt will happen. Why is it preferable to have this mess spill into the Maven artifacts? This seems a few steps down the list of backup plans.

@ryan-williams
Copy link
Member Author

@fnothaft good question, I'll double-check that I was seeing that error on a -Dhadoop.version=1.0.4 ADAM.

@srowen

"Just" having to make 2 profiles and always having to specify at least 1 of them on a command line seems like a lot of pain for a dependency to require.

What would "always" have to be specified on the cmdline? IIUC, ADAM would default to Hadoop-2 artifacts for everything, and have a -Phadoop-1.0.4 profile instead of the -Dhadoop.version=1.0.4 we have today… and ADAM is a relatively complex case.

Most things that depend on Spark are probably OK supporting one Hadoop (major) version at a time; they would just link to those artifacts in their POM and be done with it.

Anyway, I agree that merging in spark#6599 is simpler in the short term, so I think we're just debating whether the Maven route is

  1. better in the long term, and
  2. better enough in the long term to warrant the cost of publishing more Maven artifacts.

Seems like @fnothaft and I definitely believe 1), while 2) is debatable, but that you are skeptical about 1) because of the complexity you think it will add to things that depend on Spark…?

@srowen
Copy link

srowen commented Jun 3, 2015

The gotcha is that all default profiles get disabled once any profile is
enabled. You can probably use a sys property to make this better I suppose.

I suspect that 1.x support will just be dropped before publishing 2x the
artifacts happens as this is much madness to support very few deployments.
IMHO.

On Thu, Jun 4, 2015, 12:25 AM Ryan Williams notifications@github.com
wrote:

@fnothaft https://github.com/fnothaft good question, I'll double-check
that I was seeing that error on a -Dhadoop.version=1.0.4 ADAM.

@srowen https://github.com/srowen

"Just" having to make 2 profiles and always having to specify at least 1
of them on a command line seems like a lot of pain for a dependency to
require.

What would "always" have to be specified on the cmdline? IIUC, ADAM would
default to Hadoop-2 artifacts for everything, and have a -Phadoop-1.0.4
profile instead of the -Dhadoop.version=1.0.4 we have today… and ADAM is
a relatively complex case.

Most things that depend on Spark are probably OK supporting one Hadoop
(major) version at a time; they would just link to those artifacts in their
POM and be done with it.

Anyway, I agree that merging in spark#6599 is simpler in the short term,
so I think we're just debating whether the Maven route is

  1. better in the long term, and
  2. better enough in the long term to warrant the cost of publishing
    more Maven artifacts.

Seems like @fnothaft https://github.com/fnothaft and I definitely
believe 1), while 2) is debatable, but that you are skeptical about 1)
because of the complexity you think it will add to things that depend on
Spark…?


Reply to this email directly or view it on GitHub
#690 (comment).

@tomwhite
Copy link
Member

tomwhite commented Jun 4, 2015

To add to @fnothaft's comment about the spark-ec2 scripts: they are great for running Spark, but the version of MR that they install is old (1.x or 2.0, not the later 2.x versions) so it's probably best to avoid running MR jobs on a spark-ec2 cluster. ADAM doesn't have any MR jobs (although see #651), so it's not a problem.

Is there any reason not to drop Hadoop 1 support? For the spark-ec2 scripts HDFS installation, the Hadoop 2 version could be used. Projects like Crunch and Kite have dropped Hadoop 1 support, or are in the process of doing so.

@fnothaft
Copy link
Member

fnothaft commented Jun 4, 2015

@tomwhite
You'll need a MR cluster up if you're going to distcp data from S3, no? It just seems to me that if MR is broken on spark-ec2 Hadoop 2.0.0, we can't actually functionally use --hadoop-major-version=2 for spark-ec2.

@srowen

I'm talking about Maven artifacts, which are quite a separate issue from the pre-built assemblies you are linking to. These are not at all the same thing. Spark has ~22 Maven modules, times 2 Scala versions. Times 2 Hadoop versions is 88!

I know that they're separate. I don't know how Spark is doing their Maven releases (e.g., we use the OSS Sonatype Maven release flow, which may be very different from the Apache Maven release flow), but for us, releasing the parent project (adam-parent) is essentially a "one click" operation that releases all 4 of our modules. For us, even releasing Scala 2.10/2.11 artifacts is effectively a "one click" operation...

In any case, it sounds like there isn't interest in Spark distributing Maven artifacts that are separate for Hadoop 1.x and 2.x, so I'm going to drop this point. I suppose that the takeaway then is that we will drop Hadoop 1.x support in the next release of ADAM. We'll then need to shake out things downstream of that.

seems like a smooth transition, but unused AWS dep was pulling in a
too-old jackson.core/jackson.annotations.

fixes bigdatagenomics#665
@ryan-williams
Copy link
Member Author

I'm picking this back up since it sounds like we're cool with dropping Hadoop 1 support from ADAM. cc @heuermh who also expressed interest in this.

@heuermh
Copy link
Member

heuermh commented Aug 3, 2015

Great!

The jackson dependency issues should have been resolved via #744. I'm not sure what to do about bigdatagenomics/utils#41.

@fnothaft
Copy link
Member

fnothaft commented Aug 4, 2015

@ryan-williams @laserson @heuermh should we close this in favor of #753?

@laserson
Copy link
Contributor

laserson commented Aug 4, 2015

sgtm

@ryan-williams ryan-williams deleted the spark-1.3.1 branch August 6, 2015 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade to Spark 1.3.1
7 participants