-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2978. Transformation with MR shuffle semantics #2274
Conversation
QA tests have started for PR 2274 at commit
|
QA tests have finished for PR 2274 at commit
|
QA tests have started for PR 2274 at commit
|
QA tests have finished for PR 2274 at commit
|
@@ -514,6 +514,30 @@ def __add__(self, other): | |||
raise TypeError | |||
return self.union(other) | |||
|
|||
def repartitionAndSortWithinPartition(self, ascending=True, numPartitions=None, | |||
partitionFunc=portable_hash, keyfunc=lambda x: x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about re-arrange the parameters to follow the function name? such as:
repartitionAndSortWithinPartition(self, numPartitions=None, partitionFunc=portable_hash,
ascending=True, keyfunc=lambda x: x)
a1ef807
to
423650a
Compare
Updated patch removes Python version, adds Java version, and adds some additional doc. |
Just a nit, it should probably be called repartitionAndSortWithinPartition_s_. Also, this name is pretty long. Another one I'd reconsider is Finally I think it should be a policy to add all these APIs to Python, and implement them there too. Basically there are two options -- if you're doing this to support a slightly easier transition from MR jobs, but you don't want to do it in Python, you could just have it as a document, or an example, or maybe even a third-party package that takes a Hadoop JobConf and runs it on Spark. But if you want it in Spark, we need to put it in each language. The reason is to allow people to easily read code in one supported language and run it in others -- it's always disappointing when some operators turn out to be missing in yours. |
The reason to add this is because this is a smaller API that we can support (both source and binary compatibility) in the long run before finalizing ShuffledRDD (since that one has been in flux and changing in multiple past releases). Perhaps we can mark this new API as DeveloperApi but commit to maintaining it. What do you think? The naming is long, but I'm worried repartitionWithSort in a way implies the data are sorted globally. |
* because it can push the sorting down into the shuffle machinery. | ||
*/ | ||
def repartitionAndSortWithinPartition(partitioner: Partitioner) | ||
: RDD[(K, V)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u can put this on the previous line ...
Ah, I see. Then we can add it, but in that case I'd also add it in Python. |
15b2f90
to
1340d75
Compare
Updated patch adds Python back in and adds the 's' at the end. |
Thanks, Sandy. Can you add a unit test in Java to make sure the thing is callable from Java? |
|
||
self.assertRaises(ValueError, lambda: rdd.countApproxDistinct(0.00000001)) | ||
self.assertRaises(ValueError, lambda: rdd.countApproxDistinct(0.5)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are removed by accident during merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, my bad
f249f74
to
c04b447
Compare
QA tests have started for PR 2274 at commit
|
QA tests have finished for PR 2274 at commit
|
QA tests have started for PR 2274 at commit
|
QA tests have finished for PR 2274 at commit
|
LGTM, thanks. |
Thanks Sandy! I've merged this. |
I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.