[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF #20237

HyukjinKwon · 2018-01-11T15:49:58Z

What changes were proposed in this pull request?

This PR proposes to add a note that saying the length of a scalar Pandas UDF's Series is not of the whole input column but of the batch.

We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF.

For example, please consider this example:

from pyspark.sql.functions import pandas_udf, col, lit

df = spark.range(1)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()

+------------------+
|<lambda>(text, id)|
+------------------+
|                 1|
+------------------+

from pyspark.sql.functions import udf, col, lit

df = spark.range(1)
f = udf(lambda x, y: len(x) + y, "long")
df.select(f(lit('text'), col('id'))).show()

+------------------+
|<lambda>(text, id)|
+------------------+
|                 4|
+------------------+

How was this patch tested?

Manually built the doc and checked the output.

…as UDF

HyukjinKwon · 2018-01-11T15:51:35Z

Hey @gatorsmile, @ueshin, @BryanCutler and @icexelloss. Let's fix this by clarifying it to avoid potential confusion for now and clear up SPARK-22216's subtasks.

icexelloss · 2018-01-11T16:07:34Z

@HyukjinKwon Thanks! I think this is good.

SparkQA · 2018-01-11T16:26:59Z

Test build #85974 has finished for PR 20237 at commit d2cfed3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Thanks @HyukjinKwon , just a small suggestion but feel free to use what you currently have too, it sounds fine also.

BryanCutler · 2018-01-12T00:36:10Z

python/pyspark/sql/functions.py

@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
       |         8|      JOHN DOE|          22|
       +----------+--------------+------------+

+       .. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column
+           but of the batch internally used, and it is called for each batch. Therefore,


Does this sound a little better? "..scalar UDF is not that of the whole input column, but is the length of an internal batch used for each call to the function."

Yup, English isn't really my area :(. Will try to incorporate your suggestion.

BryanCutler · 2018-01-12T00:37:53Z

python/pyspark/sql/functions.py

+       .. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column
+           but of the batch internally used, and it is called for each batch. Therefore,
+           this can be used, for example, to ensure the length of each returned `pandas.Series`
+           but should not be used as the length of the whole input.


How does this sound? "..pandas.Series, and can not be used as the column length"

SparkQA · 2018-01-12T08:36:39Z

Test build #86025 has finished for PR 20237 at commit 0fa39d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ch batch within scalar Pandas UDF ## What changes were proposed in this pull request? This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch. We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF. For example, please consider this example: ```python from pyspark.sql.functions import pandas_udf, col, lit df = spark.range(1) f = pandas_udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ |<lambda>(text, id)| +------------------+ | 1| +------------------+ ``` ```python from pyspark.sql.functions import udf, col, lit df = spark.range(1) f = udf(lambda x, y: len(x) + y, "long") df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ |<lambda>(text, id)| +------------------+ | 4| +------------------+ ``` ## How was this patch tested? Manually built the doc and checked the output. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20237 from HyukjinKwon/SPARK-22980. (cherry picked from commit cd9f49a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

HyukjinKwon · 2018-01-13T07:15:24Z

Merged to master and branch-2.3.

gatorsmile · 2018-01-13T13:20:24Z

python/pyspark/sql/functions.py

@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
       |         8|      JOHN DOE|          22|
       +----------+--------------+------------+

+       .. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
+           column, but is the length of an internal batch used for each call to the function.


Nit: but is -> but

gatorsmile · 2018-01-13T13:22:09Z

python/pyspark/sql/functions.py

@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
       |         8|      JOHN DOE|          22|
       +----------+--------------+------------+

+       .. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
+           column, but is the length of an internal batch used for each call to the function.
+           Therefore, this can be used, for example, to ensure the length of each returned


ensure? What does this mean?

How about measure?

I meant to ensure the length of the batch because we declare "The length of the returned pandas.Series must be of the same as the input pandas.Series."

gatorsmile · 2018-01-13T13:23:52Z

length is the only common function?

HyukjinKwon · 2018-01-13T14:45:49Z

No, there are many other functions but this specific case could bring confusion as the length is not the length of the value and also not the length of the whole input column. In other cases, usually calling other functions on the pandas series produce expected results.

gatorsmile · 2018-01-13T20:20:29Z

The newly added description is not clear to most Spark users. I think the descriptions added by this PR does not explain the common error cases pointed out in the JIRA.

HyukjinKwon · 2018-01-13T22:35:58Z

Mind if I ask what you expect to fix @gatorsmile? It's clear and explans the results.

Clarify the length of each series is of each batch within scalar Pand…

d2cfed3

…as UDF

BryanCutler reviewed Jan 12, 2018

View reviewed changes

Address comments

0fa39d0

asfgit closed this in cd9f49a Jan 13, 2018

gatorsmile reviewed Jan 13, 2018

View reviewed changes

HyukjinKwon deleted the SPARK-22980 branch October 16, 2018 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF #20237

[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF #20237

HyukjinKwon commented Jan 11, 2018

HyukjinKwon commented Jan 11, 2018

icexelloss commented Jan 11, 2018

SparkQA commented Jan 11, 2018

BryanCutler left a comment

BryanCutler Jan 12, 2018

HyukjinKwon Jan 12, 2018

BryanCutler Jan 12, 2018

SparkQA commented Jan 12, 2018

HyukjinKwon commented Jan 13, 2018

gatorsmile Jan 13, 2018

gatorsmile Jan 13, 2018 •

edited

Loading

HyukjinKwon Jan 13, 2018

gatorsmile commented Jan 13, 2018 •

edited

Loading

HyukjinKwon commented Jan 13, 2018

gatorsmile commented Jan 13, 2018

HyukjinKwon commented Jan 13, 2018

[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF #20237

[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF #20237

Conversation

HyukjinKwon commented Jan 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Jan 11, 2018

icexelloss commented Jan 11, 2018

SparkQA commented Jan 11, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler Jan 12, 2018

Choose a reason for hiding this comment

HyukjinKwon Jan 12, 2018

Choose a reason for hiding this comment

BryanCutler Jan 12, 2018

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2018

HyukjinKwon commented Jan 13, 2018

gatorsmile Jan 13, 2018

Choose a reason for hiding this comment

gatorsmile Jan 13, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jan 13, 2018

Choose a reason for hiding this comment

gatorsmile commented Jan 13, 2018 • edited Loading

HyukjinKwon commented Jan 13, 2018

gatorsmile commented Jan 13, 2018

HyukjinKwon commented Jan 13, 2018

gatorsmile Jan 13, 2018 •

edited

Loading

gatorsmile commented Jan 13, 2018 •

edited

Loading