[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

ueshin · 2018-02-05T10:18:48Z

What changes were proposed in this pull request?

In Python 2, when pandas_udf tries to return string type value created in the udf with "..", the execution fails. E.g.,

from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()

raises the following exception:

...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)

...

Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

This pr adds a workaround for the case.

How was this patch tested?

Added a test and existing tests.

…rly.

ueshin · 2018-02-05T10:25:16Z

cc @BryanCutler @icexelloss @HyukjinKwon
Could you help me double-check this?
Since seems like this happens only in Python 2 environment, Jenkins will skip the tests.
And let me know if you know better workaround.

SparkQA · 2018-02-05T10:56:26Z

Test build #87063 has finished for PR 20507 at commit 47b8873.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM. I don't have a better idea. Just two nits I found while double checking.

HyukjinKwon · 2018-02-05T13:19:34Z

python/pyspark/sql/tests.py

+        import pandas as pd
+        df = self.spark.range(10)
+        str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())
+        res = df.select(str_f(col('id')))


How about variable names 'expected' and 'actual'?

Sure, I'll update it.

HyukjinKwon · 2018-02-05T13:33:54Z

python/pyspark/sql/tests.py

+        from pyspark.sql.functions import pandas_udf, col
+        import pandas as pd
+        df = self.spark.range(10)
+        str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())


Not a big deal. How about pd.Series(map(str, x))?

Sounds good. I'll take it.

SparkQA · 2018-02-05T14:46:31Z

Test build #87069 has finished for PR 20507 at commit 06ae568.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-05T16:03:36Z

python/pyspark/serializers.py

@@ -230,6 +230,9 @@ def create_array(s, t):
            s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
            # TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2
            return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+        elif t is not None and pa.types.is_string(t) and sys.version < '3':
+            # TODO: need decode before converting to Arrow in Python 2
+            return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask, type=t)


@ueshin, actually, how about s.apply(lambda v: v.decode("utf-8") if isinstance(v, str) else v) to allow non-ascii encodable unicodes too like u"아"? I was worried of performance but I ran a simple perf test vs s.str.decode('utf-8') for sure. Seems actually fine.

Good catch! I'll take it. Thanks!

ueshin · 2018-02-06T02:03:07Z

Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

@BryanCutler Btw, do you think this is a bug of pyarrow in Python 2?

SparkQA · 2018-02-06T02:31:17Z

Test build #87083 has finished for PR 20507 at commit b3d5209.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-02-06T05:26:16Z

also cc @cloud-fan @gatorsmile @sameeragarwal

BryanCutler · 2018-02-06T08:55:06Z

Sorry I've been travelling, but I'll try to look into this soon on the Arrow side to see if it is a bug in pyarrow. The workaround here seems fine to me.

HyukjinKwon · 2018-02-06T09:30:46Z

Merged to master and branch-2.3.

…() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20507 from ueshin/issues/SPARK-23334. (cherry picked from commit 63c5bf1) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

ueshin · 2018-02-06T09:37:58Z

Thanks! @HyukjinKwon @BryanCutler

BryanCutler · 2018-02-06T19:44:41Z

I made https://issues.apache.org/jira/browse/ARROW-2101 to track the issue in Arrow

Fix pandas_udf with return type StringType() to handle str type prope…

47b8873

…rly.

HyukjinKwon approved these changes Feb 5, 2018

View reviewed changes

Address comments.

06ae568

HyukjinKwon reviewed Feb 5, 2018

View reviewed changes

Address a comment.

b3d5209

asfgit closed this in 63c5bf1 Feb 6, 2018

HyukjinKwon mentioned this pull request Feb 7, 2018

[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs #20531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

ueshin commented Feb 5, 2018

ueshin commented Feb 5, 2018

SparkQA commented Feb 5, 2018

HyukjinKwon left a comment

HyukjinKwon Feb 5, 2018

ueshin Feb 5, 2018

HyukjinKwon Feb 5, 2018

ueshin Feb 5, 2018

SparkQA commented Feb 5, 2018

HyukjinKwon Feb 5, 2018 •

edited

Loading

ueshin Feb 6, 2018

ueshin commented Feb 6, 2018

SparkQA commented Feb 6, 2018

ueshin commented Feb 6, 2018

BryanCutler commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018

ueshin commented Feb 6, 2018

BryanCutler commented Feb 6, 2018

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

Conversation

ueshin commented Feb 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Feb 5, 2018

SparkQA commented Feb 5, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon Feb 5, 2018

Choose a reason for hiding this comment

ueshin Feb 5, 2018

Choose a reason for hiding this comment

HyukjinKwon Feb 5, 2018

Choose a reason for hiding this comment

ueshin Feb 5, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2018

HyukjinKwon Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

ueshin Feb 6, 2018

Choose a reason for hiding this comment

ueshin commented Feb 6, 2018

SparkQA commented Feb 6, 2018

ueshin commented Feb 6, 2018

BryanCutler commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018

ueshin commented Feb 6, 2018

BryanCutler commented Feb 6, 2018

HyukjinKwon Feb 5, 2018 •

edited

Loading