-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark #19339
Conversation
@HyukjinKwon @viirya Could you review this PR? Thanks! :) |
* @since 2.2.0 | ||
*/ | ||
@deprecated("Use csv(Dataset[String]) instead.", "2.2.0") | ||
def csv(csvRDD: RDD[String]): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait ... I think we shouldn't introduce an RDD API in Scala side. I was thinking doing this within Python-side, or maybe adding a private wrapper in Scala side if required .. Will take a closer look tomorrow (KST).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviewing :)
umm.. I followed spark.read.json
's way to add them. Although json(jsonRDD :RDD[String]
has been deprecated, PySpark still use it to create a DataFrame
. I think adding a private wrapper in Scala maybe better because not only PySpark but SparkR maybe need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. +1 for @HyukjinKwon's advice. We cannot add a deprecated method which doesn't exist in 2.2.0 at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah...It's weird to add a deprecated method. :) We either add a special wrapper for this purpose or doing this in python-side if possible and not complicated.
ok to test |
Test build #82148 has finished for PR 19339 at commit
|
yield x | ||
keyed = path.mapPartitions(func) | ||
keyed._bypass_serializer = True | ||
jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a way within Python and this seems working:
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 1ed452d895b..0f54065b3ee 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -438,7 +438,10 @@ class DataFrameReader(OptionUtils):
keyed = path.mapPartitions(func)
keyed._bypass_serializer = True
jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
- return self._df(self._jreader.csv(jrdd))
+ jdataset = self._spark._jsqlContext.createDataset(
+ jrdd.rdd(),
+ self._spark._sc._jvm.Encoders.STRING())
+ return self._df(self._jreader.csv(jdataset))
else:
raise TypeError("path can be only string, list or RDD")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldmedal, it'd be great if you could double check whether this really works and it can be shorten or cleaner. This was just my rough try only to reach the goal so I am not sure if it is the best way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, This way looks good. I'll try it. Thanks for your suggestion.
@HyukjinKwon I think your way works fine after fixing a variable name bug ( |
As it relies on a deprecated API, I think it is also good to replace pyspark json to use Dataset. But I think it is better in another PR. |
Yea, let's do it separately. |
ok, so maybe I create another JIRA for this issue? |
Test build #82195 has finished for PR 19339 at commit
|
Hm to me .. I'd actually leave it for now. I am less sure if we should fix it now as we could sweep it out when we remove the deprecated ones later together and, for the current status, it actually does not cause any problem for now, e.g., build warning, if I understood correctly. I won't stay against but I think I don't support. Let's go ahead with this one first. |
This is so weird. I run it fine using Python 3.5.2 but it seems to have some problem using Python 3.4. Let me try Python 3.4 in my local. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/pyspark/sql/readwriter.py
Outdated
keyed = path.mapPartitions(func) | ||
keyed._bypass_serializer = True | ||
jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) | ||
jdataset = self._spark._ssql_ctx.createDataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a small comment here to explain why we should create the dataset (which could look a bit weird in PySpark I believe).
python/pyspark/sql/readwriter.py
Outdated
jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) | ||
jdataset = self._spark._ssql_ctx.createDataset( | ||
jrdd.rdd(), | ||
self._spark._sc._jvm.Encoders.STRING()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace _spark._sc._jvm
to _spark._jvm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's work. I'll modify it.
retest this please |
Test build #82198 has finished for PR 19339 at commit
|
python/pyspark/sql/readwriter.py
Outdated
keyed._bypass_serializer = True | ||
jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) | ||
# [SPARK-22112] | ||
# There aren't any jvm api for creating a dataframe from rdd storing csv. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just personal preference: SPARK-22112: ...
or see SPARK-22112
if you wouldn't mind ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let me fix it. thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the usual style.
python/pyspark/sql/readwriter.py
Outdated
@@ -336,6 +336,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non | |||
``inferSchema`` option or specify the schema explicitly using ``schema``. | |||
|
|||
:param path: string, or list of strings, for input path(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: . -> ,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks :)
LGTM |
Test build #82201 has finished for PR 19339 at commit
|
Test build #82202 has finished for PR 19339 at commit
|
Test build #82203 has finished for PR 19339 at commit
|
umm.. I test it always fine using Python 3.4 in my local. I'm not sure why did it test fail with Jenkins sometime... :( |
In a quick look, both tests failures:
Sounds not related with the current PR (my rough assumption is, it's, IMHO, instability of Py4J). |
@goldmedal, are you online now? how about fixing the PR title to say something like .. "Supports RDD of strings as input in spark.read.csv in PySpark"? |
@HyukjinKwon I has updated this title. Thanks ! |
Thanks @goldmedal. |
Merged to master. |
I've tested few times locally. Can't have the same failure. |
keyed._bypass_serializer = True | ||
jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) | ||
# see SPARK-22112 | ||
# There aren't any jvm api for creating a dataframe from rdd storing csv. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix these comments like,
SPARK-22112: There aren't any jvm api for creating a dataframe from rdd storing csv.
...
or
There aren't any jvm api ...
...
for creating a dataframe from dataset storing csv. See SPARK-22112.
when we happened to fix some code around here or review other PRs fixing some codes around here in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks
@HyukjinKwon @viirya Thanks for your reviewing. |
What changes were proposed in this pull request?
We added a method to the scala API for creating a
DataFrame
fromDataSet[String]
storing CSV in SPARK-15463 but PySpark doesn't haveDataset
to support this feature. Therfore, I add an API to create aDataFrame
fromRDD[String]
storing csv and it's also consistent with PySpark'sspark.read.json
.For example as below
How was this patch tested?
add unit test cases.