-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade to Spark 1.3.1 #690
Conversation
some kind of POM error? Can't tell if it's my fault, doesn't seem like it is from what I can tell… |
@@ -473,6 +473,16 @@ | |||
<artifactId>aws-java-sdk</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we actually using this dependency right now? I don't think we are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be preferable to just remove the dependency (as opposed to excluding things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
78b4f22
to
7d861cd
Compare
rebased |
let me try removing that dep |
7d861cd
to
debdb5d
Compare
removed the dep, and excluding it from being brought in by on second thought, that might not be the correct thing to do? does |
I'm going to do what I said there and switch to just excluding the jackson things, not the aws-java-sdk, since utils-io does seem to use it. |
In other news I can't get a break with jenkins today! The only thing that looks like an error is:
Link. I've seen that a bunch of times today on different PRs, no idea what it means. |
debdb5d
to
6525bfa
Compare
OK, I think this is a better version! |
That sounds reasonable here. I've opened https://github.com/bigdatagenomics/utils/issues/41to remove the AWS SDK from |
+1, agreed! |
If the code that uses it in |
@ryan-williams yeah, that's what I was meaning. The |
What does it mean? |
Jenkins, retest this please. |
@ryan-williams I don't think the error you're seeing in:
Is the error. I'm seeing:
This is only for the Hadoop 1.0.4 builds. I'm wondering if this is an issue in Spark? |
I'm going to check this out locally and run with 1.0.4 and see if this failure reproduces. |
Ahhhh... this fails locally (with the error Jenkins is reporting) for me when running:
Surely, they didn't break Hadoop 1.0.4 support in Spark 1.3.1, did they? |
@ryan-williams I'm hoping to cut a 0.17.0 release tomorrow. Do you want to check and see if you can resolve the Spark 1.3.1/Hadoop 1.0.4 issue by tomorrow? If you can't, shall we slip this to 0.18.0? |
good call, thanks for catching. I'll see what I can do, though not sure how much time I'll have :-\ |
No worries. We can take more of a hack at this later. |
So, I am tentatively thinking that depending on Spark 1.3.1 excludes the possibility to build against Hadoop 1.0.4. This user@spark thread is relevant, and I just responded cc'ing @fnothaft, explaining the failure we're seeing and linking to this PR. Spark 1.3.1 is "provided" and not bringing in the Hadoop2.2 dependency that it is hard-coded into it, which means that any Spark code-paths we exercise run the risk of using bits of Hadoop2 that might not have existed in a compatible way in Hadoop1.0.4. I suspect there's a way to shade Hadoop2.2 into Spark so that this would work for us, but I assume that would have to be done in Spark's POM; I've not seen Maven functionality related to shading an arbitrary version of a dep's dep into said dep, if that makes sense. Curious to hear others' thoughts! How bad is the prospect of dropping Hadoop1.0.4 support?
|
Yeah... I am moderately frustrated here. Either we're doing something wrong with how we're building against Spark 1.3.1, or their testing infrastructure is really weak. While it would be a lie to say that running on Hadoop 1.0.4 has always been seamless, it's generally worked OK. What is possible is that we're running into a "build-only" problem. E.g., it is possible that the Maven artifact for Spark
We run a matrix build in Jenkins that sweeps the Hadoop version across 1.0.4, 2.2.0, 2.3.0. I believe this is sufficient for testing; it did catch the issue we would've run into here, and has caught similar past issues. Perhaps I'm misunderstanding what you're getting at?
I don't think there's a straightforward way to do that. I think the best solution would be a fix to how Spark distributes resources. E.g., Spark provides pre-built artifacts for different Hadoop versions. While they're building these artifacts, they could also deploy those same artifacts to Maven Central, with a classifier that corresponds to the Hadoop version. |
I think we have the same understanding though we may have talked past each other a bit.
Exactly. @srowen basically asserts this in his first email on the thread I linked previously (I can't link directly to it on that site; here it is on apache.org):
So Spark is deliberately only publishing artifacts built for one version of Hadoop at a time, though they test against a matrix similar to ADAM's, I think. OTOH, Spark and ADAM both follow the common convention of actually shipping separate artifacts for Scala Finally, to your implication that Spark doesn't run tests for multiple versions of Hadoop, I also didn't think that was the case but can't find evidence to the contrary poking around Spark PRs and Jenkins setup; could be a reasonable other question to ask if we ever get a response on the aforelinked thread.
From the little I know of Maven and The best case is that there is an already-published On that note, I'm curious about where you found the http://d3kbcqa49mib13.cloudfront.net/spark-1.3.1-bin-hadoop1.tgz you linked to? |
@ryan-williams
Yes, agreed. I think my point was unclear in my original comment, but what I meant this comment to illustrate is why we needed an explicit dependency on Hadoop in our pom.
Our unit tests are running using the Maven artifacts for Spark 1.3.1, with Hadoop excluded. We then have an explicit dependency on Hadoop version 1.0.4, so that we can build the correct input/output formats. The Maven artifacts are relevant because this error is coming up during a unit test run. Although the
Historically, excluding Hadoop from Spark and then explicitly setting the Hadoop version was OK and would build and run unit tests fine. I think the problem we're running into here isn't an actual problem when we run ADAM as an app. I think it is solely a build problem that causes our unit tests to fail. I'm guessing that we could fix the issue by moving Spark from a
I'm in agreement here. I don't think we're seeing a problem that would manifest during deployment (since the dependancies are provided), rather I think we're seeing a build issue (that causes our unit tests to fail on Hadoop 1.0.4). That being said, I think that the way we'd need to fix it downstream (in ADAM, that is) is really ugly (the above solution with |
Yeah, so you're making an ad-hoc "deployment" from Maven artifacts for tests. As far as I know that was never guaranteed to work and would surprise me if it had been across 1.x and 2.x at the same time. Maybe it very nearly does, and theory aside, a little change might make it still happen to work and that's worth it. It's brittle though and I expect that 1.x support will just go away more readily than 2-3x more artifacts are published for this. No idea about spark-ec2. It's a different rhetorical question, but that's such an old Hadoop version, is it important? |
I don't know if it's ever been "guaranteed" to work, but we've done this split Hadoop 1.0.4/2.2.0/2.3.0 build for CI without issues since pre-Spark 1.0.0 IIRC.
The reason I mention The reason I suggest publishing more artifacts is that Spark is publishing these artifacts already (to the Spark download page); it just isn't publishing them to Maven. It seems to me like this should be a fairly seamless change to make, but:
Anyways, Spark isn't going to retroactively push Spark 1.3.1 artifacts to Maven with a selector for Hadoop 1.x, so this discussion is somewhat hypothetical. On a more concrete level, it sounds like our choices are:
|
I should probably direct comment to apache/spark#6599 since here I'm just speaking for myself, and that's not necessarily representative. I can see the very practical argument that a small fix is worth it to preserve some existing behavior even if not guaranteed; my only open question is whether that's really all this needs, but it may be so. Yeah I don't know much about spark-ec2. I would assume it can / will work with Hadoop 2 if needed, especially since Hadoop 1 is getting so old. I think it's only moderately hard to add more Maven artifacts but it sure makes a mess -- Spark would deploy almost 100 artifacts per version. Now you have to change the artifacts you even depend on based on the version of another thing -- that's hard to write in Maven. It also exacerbates the Scala 2.10/2.11 problem for which there's really no other answer. This is why I can appreciate that anything reasonable to avoid this is worthwhile. I speak for myself when I say I think 1.x support is going away, but I do think it's true in the medium term. It doesn't get much usage; 2.2 is already the default; old YARN support already went away. |
@srowen your point makes sense that we are:
I agree that reflection whack-a-mole in Spark to try to make all Hadoop API calls {1,2}-cross-compatible is not a good solution, even if it is just the one call in spark#6599 that doesn't work today. However, I'm curious about your hesitance to just fix this in Maven:
This is true, but the "another thing" in question (i.e. Hadoop version) affects whether programs will execute successfully or not; such a thing seems absolutely worth specifying explicitly, to me. Our (ADAM's) problems here start with 1. above: we have no choice but to link against a Spark built for Hadoop 2 (or use Maven in a brittler/messier way, e.g. manually downloading Spark artifacts as system deps); given the choice, we (ADAM) would happily adjust our POMs and workflows to link against the correct Spark JAR for each Hadoop version we want to support.
IANA Maven expert, but won't this just mean we'll have to have profiles for the version of Hadoop we want (hopefully very coarse-grained, e.g. Most importantly, said profiles only need exist in things that depend on Spark, e.g. ADAM; they wouldn't require any changes to Spark other than the publishing of artifacts to Maven that are already built and published to https://spark.apache.org/downloads.html. |
I think apache/spark#6599 is a much easier way forward, if it really does do the trick, since it only has to last until 1.x goes away. Yes, specifying dependency versions is vital in general. This makes it significantly harder to set Hadoop version since you are not only setting Hadoop version, but must apply one Maven profile or the other to get the correct Spark version. Defaults won't even work. A hard solution to a problem that must be solved is better than no solution but the PR above is much more inocuous. |
@srowen fair enough. We are also assuming that spark#6599 fixes the only incompatible Hadoop API call that we are going to run in to here; I'll build a Spark 1.3.1 with spark#6599 patched in and run ADAM tests against it now, to check. |
I ran ADAM tests against Spark-1.3.1+#6599 and Hadoop 1.0.4; the crash reported above here is fixed but the same issue crops up elsewhere:
This time it's ADAM code that is calling a method on a I'll try to work around the issue in that ADAM code path and see where we are then. |
6525bfa
to
853b45c
Compare
OK @fnothaft, check out the 2nd commit here (853b45c): it creates a wrapper for This gets all the tests passing, locally at least, and when run against a Spark-1.3.1 with spark#6599 patched in. Totally mysterious to me is why the same call in |
@ryan-williams I think that wrapping the Hadoop |
@massie interesting, I was not aware of the I'm afraid I don't completely follow how you're suggesting I use it here; will it more cleanly work around:
If you can provide more details I'm happy to try to use it instead, I believe that there is a better way than what I've done here! |
I was referring to |
ah ok, no worries, thanks for pointing that out to me, I am uninitiated to the world of Hadoop 1-vs-2 incompatibilities and workarounds thereof! |
Just to confirm, when you built ADAM, you did provide the
Technically speaking,
Spark provides 7 pre-built versions, times two for the Scala 2.10/2.11 change, so it should be 14, not 100 artifacts, no? Perhaps I'm missing something.
Unless I'm misunderstanding what you're proposing, this is trivial to do in Maven via either profiles or classifiers. Keep in mind that the problem we're discussing here only pops up if the project that is building against Spark also needs to build against Hadoop public interfaces (i.e., they define custom input formats). In most projects, you can use Spark as a provided dependency and just rely on the transitive Hadoop dependency and be fine. Since we depend on Hadoop anyways, we already have to parametrize our build for this. Adding a second level of parametrization (parametrize which version of Spark to pull down based on a desired Hadoop version) would be easy to layer on top. TL;DR: exactly what @ryan-williams said:
I agree pretty much verbatim with @ryan-williams's comment upthread.
Doing split builds for Scala 2.10/2.11 purely in Maven is difficult because by convention Scala projects change the artifact name with a change in major Scala version, which Maven doesn't support. However, it's fairly straight forward to script these split builds up (e.g., see https://github.com/bigdatagenomics/adam/blob/master/scripts/release/release.sh). |
I'm talking about Maven artifacts, which are quite a separate issue from the pre-built assemblies you are linking to. These are not at all the same thing. Spark has ~22 Maven modules, times 2 Scala versions. Times 2 Hadoop versions is 88! "Just" having to make 2 profiles and always having to specify at least 1 of them on a command line seems like a lot of pain for a dependency to require. Definitely doable but not fun, even for projects that are willing to do it. I'd definitely prefer to hear that a few tweaks just make this problem go away rather than bother with Hadoop-specific artifacts, which I doubt will happen. Why is it preferable to have this mess spill into the Maven artifacts? This seems a few steps down the list of backup plans. |
@fnothaft good question, I'll double-check that I was seeing that error on a
What would "always" have to be specified on the cmdline? IIUC, ADAM would default to Hadoop-2 artifacts for everything, and have a Most things that depend on Spark are probably OK supporting one Hadoop (major) version at a time; they would just link to those artifacts in their POM and be done with it. Anyway, I agree that merging in spark#6599 is simpler in the short term, so I think we're just debating whether the Maven route is
Seems like @fnothaft and I definitely believe 1), while 2) is debatable, but that you are skeptical about 1) because of the complexity you think it will add to things that depend on Spark…? |
The gotcha is that all default profiles get disabled once any profile is I suspect that 1.x support will just be dropped before publishing 2x the On Thu, Jun 4, 2015, 12:25 AM Ryan Williams notifications@github.com
|
To add to @fnothaft's comment about the spark-ec2 scripts: they are great for running Spark, but the version of MR that they install is old (1.x or 2.0, not the later 2.x versions) so it's probably best to avoid running MR jobs on a spark-ec2 cluster. ADAM doesn't have any MR jobs (although see #651), so it's not a problem. Is there any reason not to drop Hadoop 1 support? For the spark-ec2 scripts HDFS installation, the Hadoop 2 version could be used. Projects like Crunch and Kite have dropped Hadoop 1 support, or are in the process of doing so. |
@tomwhite
I know that they're separate. I don't know how Spark is doing their Maven releases (e.g., we use the OSS Sonatype Maven release flow, which may be very different from the Apache Maven release flow), but for us, releasing the parent project ( In any case, it sounds like there isn't interest in Spark distributing Maven artifacts that are separate for Hadoop 1.x and 2.x, so I'm going to drop this point. I suppose that the takeaway then is that we will drop Hadoop 1.x support in the next release of ADAM. We'll then need to shake out things downstream of that. |
seems like a smooth transition, but unused AWS dep was pulling in a too-old jackson.core/jackson.annotations. fixes bigdatagenomics#665
853b45c
to
cf85dbd
Compare
I'm picking this back up since it sounds like we're cool with dropping Hadoop 1 support from ADAM. cc @heuermh who also expressed interest in this. |
cf85dbd
to
a2ce3bd
Compare
Great! The jackson dependency issues should have been resolved via #744. I'm not sure what to do about bigdatagenomics/utils#41. |
@ryan-williams @laserson @heuermh should we close this in favor of #753? |
sgtm |
seems smooth, but a couple old jackson deps needed excluding
fixes #665