[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark #19339

goldmedal · 2017-09-25T12:50:24Z

What changes were proposed in this pull request?

We added a method to the scala API for creating a DataFrame from DataSet[String] storing CSV in SPARK-15463 but PySpark doesn't have Dataset to support this feature. Therfore, I add an API to create a DataFrame from RDD[String] storing csv and it's also consistent with PySpark's spark.read.json.

For example as below

>>> rdd = sc.textFile('python/test_support/sql/ages.csv')
>>> df2 = spark.read.csv(rdd)
>>> df2.dtypes
[('_c0', 'string'), ('_c1', 'string')]

How was this patch tested?

add unit test cases.

goldmedal · 2017-09-25T12:51:17Z

@HyukjinKwon @viirya Could you review this PR? Thanks! :)

HyukjinKwon · 2017-09-25T13:14:50Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+   * @since 2.2.0
+   */
+  @deprecated("Use csv(Dataset[String]) instead.", "2.2.0")
+  def csv(csvRDD: RDD[String]): DataFrame = {


Wait ... I think we shouldn't introduce an RDD API in Scala side. I was thinking doing this within Python-side, or maybe adding a private wrapper in Scala side if required .. Will take a closer look tomorrow (KST).

Thanks for your reviewing :)
umm.. I followed spark.read.json's way to add them. Although json(jsonRDD :RDD[String] has been deprecated, PySpark still use it to create a DataFrame. I think adding a private wrapper in Scala maybe better because not only PySpark but SparkR maybe need it.

Yep. +1 for @HyukjinKwon's advice. We cannot add a deprecated method which doesn't exist in 2.2.0 at all.

Yeah...It's weird to add a deprecated method. :) We either add a special wrapper for this purpose or doing this in python-side if possible and not complicated.

HyukjinKwon · 2017-09-25T13:18:28Z

ok to test

SparkQA · 2017-09-25T15:58:46Z

Test build #82148 has finished for PR 19339 at commit 7525b48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-26T11:48:06Z

python/pyspark/sql/readwriter.py

+                    yield x
+            keyed = path.mapPartitions(func)
+            keyed._bypass_serializer = True
+            jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())


I tried a way within Python and this seems working:

diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 1ed452d895b..0f54065b3ee 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -438,7 +438,10 @@ class DataFrameReader(OptionUtils): keyed = path.mapPartitions(func) keyed._bypass_serializer = True jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) - return self._df(self._jreader.csv(jrdd)) + jdataset = self._spark._jsqlContext.createDataset( + jrdd.rdd(), + self._spark._sc._jvm.Encoders.STRING()) + return self._df(self._jreader.csv(jdataset)) else: raise TypeError("path can be only string, list or RDD")

@goldmedal, it'd be great if you could double check whether this really works and it can be shorten or cleaner. This was just my rough try only to reach the goal so I am not sure if it is the best way.

ok, This way looks good. I'll try it. Thanks for your suggestion.

goldmedal · 2017-09-26T13:32:17Z

@HyukjinKwon I think your way works fine after fixing a variable name bug (_jsqlContext >> _jssql_ctx). Should we need to modify the json part to be consistent with the csv part?

viirya · 2017-09-26T13:34:40Z

As it relies on a deprecated API, I think it is also good to replace pyspark json to use Dataset. But I think it is better in another PR.

HyukjinKwon · 2017-09-26T13:40:45Z

Yea, let's do it separately.

goldmedal · 2017-09-26T13:46:39Z

ok, so maybe I create another JIRA for this issue?

SparkQA · 2017-09-26T13:51:34Z

Test build #82195 has finished for PR 19339 at commit 4040103.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-26T13:58:52Z

Hm to me .. I'd actually leave it for now. I am less sure if we should fix it now as we could sweep it out when we remove the deprecated ones later together and, for the current status, it actually does not cause any problem for now, e.g., build warning, if I understood correctly. I won't stay against but I think I don't support. Let's go ahead with this one first.

goldmedal · 2017-09-26T14:05:37Z

This is so weird. I run it fine using Python 3.5.2 but it seems to have some problem using Python 3.4. Let me try Python 3.4 in my local.

HyukjinKwon

LGTM

HyukjinKwon · 2017-09-26T13:31:18Z

python/pyspark/sql/readwriter.py

+            keyed = path.mapPartitions(func)
+            keyed._bypass_serializer = True
+            jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+            jdataset = self._spark._ssql_ctx.createDataset(


Let's add a small comment here to explain why we should create the dataset (which could look a bit weird in PySpark I believe).

HyukjinKwon · 2017-09-26T13:43:35Z

python/pyspark/sql/readwriter.py

+            jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+            jdataset = self._spark._ssql_ctx.createDataset(
+                jrdd.rdd(),
+                self._spark._sc._jvm.Encoders.STRING())


Could we replace _spark._sc._jvm to _spark._jvm?

yes, it's work. I'll modify it.

HyukjinKwon · 2017-09-26T14:11:08Z

retest this please

SparkQA · 2017-09-26T14:42:19Z

Test build #82198 has finished for PR 19339 at commit 4040103.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-26T14:48:51Z

python/pyspark/sql/readwriter.py

+            keyed._bypass_serializer = True
+            jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+            # [SPARK-22112]
+            # There aren't any jvm api for creating a dataframe from rdd storing csv.


just personal preference: SPARK-22112: ... or see SPARK-22112 if you wouldn't mind ..

ok let me fix it. thanks :)

Yeah, the usual style.

viirya · 2017-09-26T14:52:40Z

python/pyspark/sql/readwriter.py

@@ -336,6 +336,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
        ``inferSchema`` option or specify the schema explicitly using ``schema``.

        :param path: string, or list of strings, for input path(s).


nit: . -> ,

ok thanks :)

viirya · 2017-09-26T14:54:21Z

LGTM

SparkQA · 2017-09-26T15:09:24Z

Test build #82201 has finished for PR 19339 at commit f542967.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-26T15:17:41Z

Test build #82202 has finished for PR 19339 at commit 5988336.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-26T15:25:42Z

Test build #82203 has finished for PR 19339 at commit 032b0c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

goldmedal · 2017-09-26T16:05:03Z

umm.. I test it always fine using Python 3.4 in my local. I'm not sure why did it test fail with Jenkins sometime... :(

HyukjinKwon · 2017-09-27T00:15:32Z

In a quick look, both tests failures:

  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/readwriter.py", line 303, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))

Sounds not related with the current PR (my rough assumption is, it's, IMHO, instability of Py4J).

HyukjinKwon · 2017-09-27T00:39:47Z

@goldmedal, are you online now? how about fixing the PR title to say something like .. "Supports RDD of strings as input in spark.read.csv in PySpark"?

goldmedal · 2017-09-27T01:54:35Z

@HyukjinKwon I has updated this title. Thanks !

HyukjinKwon · 2017-09-27T02:15:58Z

Thanks @goldmedal.

HyukjinKwon · 2017-09-27T02:21:06Z

Merged to master.

viirya · 2017-09-27T02:22:16Z

I've tested few times locally. Can't have the same failure.

HyukjinKwon · 2017-09-27T02:23:01Z

python/pyspark/sql/readwriter.py

+            keyed._bypass_serializer = True
+            jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+            # see SPARK-22112
+            # There aren't any jvm api for creating a dataframe from rdd storing csv.


Let's fix these comments like,

SPARK-22112: There aren't any jvm api for creating a dataframe from rdd storing csv. ...

or

There aren't any jvm api ... ... for creating a dataframe from dataset storing csv. See SPARK-22112.

when we happened to fix some code around here or review other PRs fixing some codes around here in the future.

goldmedal · 2017-09-27T02:46:53Z

@HyukjinKwon @viirya Thanks for your reviewing.

goldmedal added 5 commits September 25, 2017 17:31

add csv from RDD[String] API and related test case

d557892

fix test case

baaa93f

finish pyspark dataframe from rdd of csv string

d4ef30a

modified comments

9bd4eed

modified comments

7525b48

HyukjinKwon reviewed Sep 25, 2017

View reviewed changes

HyukjinKwon reviewed Sep 26, 2017

View reviewed changes

goldmedal added 2 commits September 26, 2017 21:23

remove scala-side api

350a93d

use java dataset to wrap rdd api

4040103

HyukjinKwon approved these changes Sep 26, 2017

View reviewed changes

simplify the code and add comments

f542967

HyukjinKwon reviewed Sep 26, 2017

View reviewed changes

fix some comment

5988336

viirya reviewed Sep 26, 2017

View reviewed changes

fix some comment

032b0c8

goldmedal changed the title ~~[SPARK-22112][PYSPARK] Add an API to create a DataFrame from RDD[String] storing CSV~~ [SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark Sep 27, 2017

asfgit closed this in 1fdfe69 Sep 27, 2017

HyukjinKwon reviewed Sep 27, 2017

View reviewed changes

		@@ -336,6 +336,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
		``inferSchema`` option or specify the schema explicitly using ``schema``.

		:param path: string, or list of strings, for input path(s).

[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark #19339

[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark #19339

Conversation

goldmedal commented Sep 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

goldmedal commented Sep 25, 2017

HyukjinKwon Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 25, 2017

SparkQA commented Sep 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmedal Sep 26, 2017 • edited Loading

Choose a reason for hiding this comment

goldmedal commented Sep 26, 2017

viirya commented Sep 26, 2017

HyukjinKwon commented Sep 26, 2017

goldmedal commented Sep 26, 2017

SparkQA commented Sep 26, 2017

HyukjinKwon commented Sep 26, 2017

goldmedal commented Sep 26, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 26, 2017

SparkQA commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Sep 26, 2017

SparkQA commented Sep 26, 2017

SparkQA commented Sep 26, 2017

SparkQA commented Sep 26, 2017

goldmedal commented Sep 26, 2017 • edited Loading

HyukjinKwon commented Sep 27, 2017

HyukjinKwon commented Sep 27, 2017

goldmedal commented Sep 27, 2017

HyukjinKwon commented Sep 27, 2017

HyukjinKwon commented Sep 27, 2017

viirya commented Sep 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goldmedal commented Sep 27, 2017

HyukjinKwon Sep 25, 2017 •

edited

Loading

goldmedal Sep 26, 2017 •

edited

Loading

goldmedal commented Sep 26, 2017 •

edited

Loading