-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ #3740
Conversation
…yte codes for RDD
Test build #24620 has started for PR 3740 at commit
|
Test build #24620 has finished for PR 3740 at commit
|
Test PASSed. |
Since Now for other codes in Spark core, Scala will generate same byte codes for Hadoop 1.+ and Hadoop 2.+ |
Test build #24624 has started for PR 3740 at commit
|
Test build #24624 has finished for PR 3740 at commit
|
Test PASSed. |
// provide an Ordering for NullWritable so that the compiler will generate same codes. | ||
implicit val nullWritableOrdering = new Ordering[NullWritable] { | ||
override def compare(x: NullWritable, y: NullWritable): Int = 0 | ||
} | ||
this.map(x => (NullWritable.get(), new Text(x.toString))) | ||
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the problem here that while compiling Hadoop 2, the compiler chooses to specify the Ordering on the implicit rddToPairRDDFunctions, while in Hadoop 1 it instead uses the default method (return null
) to invoke the implicit?
I wonder if a more explicit solution, like the introduction of a conversion to PairRDDFunctions which takes an Ordering, is warranted for these cases. e.g.:
this.map(x => (NullWritable.get(), new Text(x.toString)))
.toPairRDD(nullWritableOrdering)
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
This would be less magical in why the definition of an implicit Ordering changes bytecode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Explicit solution is better for such tricky issue.
Test build #24635 has started for PR 3740 at commit
|
Test build #24635 has finished for PR 3740 at commit
|
Test PASSed. |
@rxin I'm not a core maintainer, and also at this point it's mainly an issue of how to style this so it is clear to future readers, so could you take a look? |
} | ||
val nullWritableClassTag = implicitly[ClassTag[NullWritable]] | ||
val textClassTag = implicitly[ClassTag[Text]] | ||
val r = this.map(x => (NullWritable.get(), new Text(x.toString))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed we can reuse the text array here to reduce gc. anyway that's not part of this PR - would you be willing to submit a new PR for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll send another PR for that after this one is merged.
Test build #24670 has started for PR 3740 at commit
|
Test build #24670 has finished for PR 3740 at commit
|
Test PASSed. |
Thanks - merging in master & branch-1.2. |
…oop 1.+ and Hadoop 2.+ `NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes. I used the following commands to confirm the generated byte codes are some. ``` mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt diff ~/hadoop1.txt ~/hadoop2.txt ``` However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`. To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`. Author: zsxwing <zsxwing@gmail.com> Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits: 39d9df2 [zsxwing] Fix the code style e4ad8b5 [zsxwing] Use null for the implicit Ordering 734bac9 [zsxwing] Explicitly set the implicit parameters ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD (cherry picked from commit 6ee6aa7) Signed-off-by: Reynold Xin <rxin@databricks.com>
Need to backport this one to branch-1.2, as the implicit APIs fixes do not exist in branch-1.2 |
|
See #3758 |
Thanks - I will merge the new PR once tests pass. |
backport #3740 for branch-1.2 Author: zsxwing <zsxwing@gmail.com> Closes #3758 from zsxwing/SPARK-2075-branch-1.2 and squashes the following commits: b57d440 [zsxwing] SPARK-2075 backport for branch-1.2
I learn a lot when review this PR,thanks |
NullWritable
is aComparable
rather thanComparable[NullWritable]
in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes forsaveAsTextFile
in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.I used the following commands to confirm the generated byte codes are some.
However, the compiler will generate different codes for the classes which call methods of
JobContext/TaskAttemptContext
.JobContext/TaskAttemptContext
is a class in Hadoop 1.+, and calling its method will useinvokevirtual
, while it's an interface in Hadoop 2.+, and will useinvokeinterface
.To fix it, we can use reflection to call
JobContext/TaskAttemptContext.getConfiguration
.