-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1690] Tolerating empty elements when saving Python RDD to text files #644
Conversation
Can one of the admins verify this patch? |
Manually verified the patch on file with empty lines in the beginning, middle or end of file. Also tested empty file and file with only empty lines. |
Can you add a test case for this? What file was it breaking on? |
Any text file with empty lines in it will break, like Glenn reported in the JIRA - a file consists of |
@mateiz just realized I could test it from Python side. Added a doctest. This makes Python API behave identical to Scala API. |
Jenkins, test this please. |
Merged build triggered. |
Merged build started. |
@@ -94,6 +94,7 @@ private[spark] class PythonRDD[T: ClassTag]( | |||
val obj = new Array[Byte](length) | |||
stream.readFully(obj) | |||
obj | |||
case 0 => Array.empty[Byte] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though you could probably just change the if length > 0
to if length >= 0
above
Merged build finished. All automated tests passed. |
All automated tests passed. |
Okay I'll pull this in. Thanks! |
… files Tolerate empty strings in PythonRDD Author: Kan Zhang <kzhang@apache.org> Closes #644 from kanzhang/SPARK-1690 and squashes the following commits: c62ad33 [Kan Zhang] Adding Python doctest 473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files (cherry picked from commit 6c2691d) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
… files Tolerate empty strings in PythonRDD Author: Kan Zhang <kzhang@apache.org> Closes apache#644 from kanzhang/SPARK-1690 and squashes the following commits: c62ad33 [Kan Zhang] Adding Python doctest 473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason @JoshRosen Author: Bouke van der Bijl <boukevanderbijl@gmail.com> Closes apache#644 from bouk/catch-depickling-errors and squashes the following commits: f0f67cc [Bouke van der Bijl] Lol indentation 0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block (cherry picked from commit 12738c1) Signed-off-by: Josh Rosen <joshrosen@apache.org>
Not print conda environment as spark doesn't have arg logging and the env can therefore contain unsafe information.
…che.kafka.clients.producer.ProducerConfig.getBoolean(Ljava/lang/String;)Ljava/lang/Boolean; (apache#644)
…che.kafka.clients.producer.ProducerConfig.getBoolean(Ljava/lang/String;)Ljava/lang/Boolean; (apache#644)
Tolerate empty strings in PythonRDD