-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277
Conversation
@@ -751,7 +751,7 @@ def func(iterator): | |||
|
|||
def pipe_objs(out): | |||
for obj in iterator: | |||
s = str(obj).rstrip('\n') + '\n' | |||
s = unicode(obj).rstrip('\n') + '\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a small test to validate this.
When you try to do this on a rdd of array of unicode string. The result of Python2 looks a bit weird.
|
Well, the difference comes from repr()'s divergent default behaviors between Python2 and Python3. And the previous code does no better than the patched one but causing troubles while processing unicode strings. On the other hand, pipe() action involved implicit serialization from any type to bytes by its definition, so IMHO the application itself should take care of consistent serialization/deserialization of data before/after pipe() action, IF it wants to always get the same behavior in different environments. |
how do I apply the patch? |
What do you think @HyukjinKwon ? I think this is probably a reasonable fix, but we might break some peoples code who have been depending on the bug. |
it seems okay without a close look. Let me take the close look if I can take the look first soon. |
ok to test |
Test build #85574 has finished for PR 18277 at commit
|
Jenkins OK to test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tentatively LGTM provided no other objections. I think this is OK but we should create a JIRA for us to document this in the release notes as a possible breaking change (it improves correctness but some folks may have written code that depends on the old behaviour). What do you think @HyukjinKwon & @viirya?
retest this please. |
This change looks reasonable to me for now. But I'm also concerned about the behavior change. A note into release notes should be good or maybe we need a note at migration guide in |
Test build #86375 has finished for PR 18277 at commit
|
Wanted to make a clarification on what we will change here to myself because it's quite confusing to me. In Python 3, it's declared above In Python 2, Before:
When
When
When
After:
When
When
When
(As a note for sure, UTF8 is ascii compatible) |
So, after this change, we will get rid of system default roundtrip in When In case of When So, LGTM but I want a double check from you @holdenk and @viirya if I missed anything. |
cc @ueshin too. I think we were in several PRs related with encoding / decoding stuff. |
@@ -751,7 +751,7 @@ def func(iterator): | |||
|
|||
def pipe_objs(out): | |||
for obj in iterator: | |||
s = str(obj).rstrip('\n') + '\n' | |||
s = unicode(obj).rstrip('\n') + '\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaoslawful, if you are active, we could change \n
to u\n
to reduce the confusion and don't rely on the implicit conversion between str
and unicode
.
Let me merge this one in few days if there's no more comments. |
Let me merge this one only into master considering the concerns - #18277 (review) and #18277 (comment). Adding a note / backporting to branch-2.3 could be fine. I don't feel strongly about it. |
retest this please |
Test build #86450 has finished for PR 18277 at commit
|
Merged to master. |
What changes were proposed in this pull request?
Pipe action convert objects into strings using a way that was affected by the default encoding setting of Python environment.
This patch fixed the problem. The detailed description is added here:
https://issues.apache.org/jira/browse/SPARK-20947
How was this patch tested?
Run the following statement in pyspark-shell, and it will NOT raise exception if this patch is applied: