-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507
Conversation
cc @BryanCutler @icexelloss @HyukjinKwon |
Test build #87063 has finished for PR 20507 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't have a better idea. Just two nits I found while double checking.
python/pyspark/sql/tests.py
Outdated
import pandas as pd | ||
df = self.spark.range(10) | ||
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType()) | ||
res = df.select(str_f(col('id'))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about variable names 'expected' and 'actual'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll update it.
python/pyspark/sql/tests.py
Outdated
from pyspark.sql.functions import pandas_udf, col | ||
import pandas as pd | ||
df = self.spark.range(10) | ||
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal. How about pd.Series(map(str, x))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll take it.
Test build #87069 has finished for PR 20507 at commit
|
python/pyspark/serializers.py
Outdated
@@ -230,6 +230,9 @@ def create_array(s, t): | |||
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone) | |||
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2 | |||
return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False) | |||
elif t is not None and pa.types.is_string(t) and sys.version < '3': | |||
# TODO: need decode before converting to Arrow in Python 2 | |||
return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask, type=t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin, actually, how about s.apply(lambda v: v.decode("utf-8") if isinstance(v, str) else v)
to allow non-ascii encodable unicodes too like u"아"
? I was worried of performance but I ran a simple perf test vs s.str.decode('utf-8')
for sure. Seems actually fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll take it. Thanks!
@BryanCutler Btw, do you think this is a bug of pyarrow in Python 2? |
Test build #87083 has finished for PR 20507 at commit
|
also cc @cloud-fan @gatorsmile @sameeragarwal |
Sorry I've been travelling, but I'll try to look into this soon on the Arrow side to see if it is a bug in pyarrow. The workaround here seems fine to me. |
Merged to master and branch-2.3. |
…() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20507 from ueshin/issues/SPARK-23334. (cherry picked from commit 63c5bf1) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Thanks! @HyukjinKwon @BryanCutler |
I made https://issues.apache.org/jira/browse/ARROW-2101 to track the issue in Arrow |
What changes were proposed in this pull request?
In Python 2, when
pandas_udf
tries to return string type value created in the udf with".."
, the execution fails. E.g.,raises the following exception:
Seems like pyarrow ignores
type
parameter forpa.Array.from_pandas()
and consider it as binary type when the type is string type and the string values arestr
instead ofunicode
in Python 2.This pr adds a workaround for the case.
How was this patch tested?
Added a test and existing tests.