-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF #20237
Conversation
Hey @gatorsmile, @ueshin, @BryanCutler and @icexelloss. Let's fix this by clarifying it to avoid potential confusion for now and clear up SPARK-22216's subtasks. |
@HyukjinKwon Thanks! I think this is good. |
Test build #85974 has finished for PR 20237 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HyukjinKwon , just a small suggestion but feel free to use what you currently have too, it sounds fine also.
python/pyspark/sql/functions.py
Outdated
@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None): | |||
| 8| JOHN DOE| 22| | |||
+----------+--------------+------------+ | |||
|
|||
.. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column | |||
but of the batch internally used, and it is called for each batch. Therefore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this sound a little better? "..scalar UDF is not that of the whole input column, but is the length of an internal batch used for each call to the function."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, English isn't really my area :(. Will try to incorporate your suggestion.
python/pyspark/sql/functions.py
Outdated
.. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column | ||
but of the batch internally used, and it is called for each batch. Therefore, | ||
this can be used, for example, to ensure the length of each returned `pandas.Series` | ||
but should not be used as the length of the whole input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this sound? "..pandas.Series
, and can not be used as the column length"
Test build #86025 has finished for PR 20237 at commit
|
…ch batch within scalar Pandas UDF ## What changes were proposed in this pull request? This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch. We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF. For example, please consider this example: ```python from pyspark.sql.functions import pandas_udf, col, lit df = spark.range(1) f = pandas_udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ |<lambda>(text, id)| +------------------+ | 1| +------------------+ ``` ```python from pyspark.sql.functions import udf, col, lit df = spark.range(1) f = udf(lambda x, y: len(x) + y, "long") df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ |<lambda>(text, id)| +------------------+ | 4| +------------------+ ``` ## How was this patch tested? Manually built the doc and checked the output. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20237 from HyukjinKwon/SPARK-22980. (cherry picked from commit cd9f49a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Merged to master and branch-2.3. |
@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None): | |||
| 8| JOHN DOE| 22| | |||
+----------+--------------+------------+ | |||
|
|||
.. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input | |||
column, but is the length of an internal batch used for each call to the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: but is
-> but
@@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None): | |||
| 8| JOHN DOE| 22| | |||
+----------+--------------+------------+ | |||
|
|||
.. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input | |||
column, but is the length of an internal batch used for each call to the function. | |||
Therefore, this can be used, for example, to ensure the length of each returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure
? What does this mean?
How about measure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to ensure the length of the batch because we declare "The length of the returned pandas.Series
must be of the same as the input pandas.Series
."
|
No, there are many other functions but this specific case could bring confusion as the length is not the length of the value and also not the length of the whole input column. In other cases, usually calling other functions on the pandas series produce expected results. |
The newly added description is not clear to most Spark users. I think the descriptions added by this PR does not explain the common error cases pointed out in the JIRA. |
Mind if I ask what you expect to fix @gatorsmile? It's clear and explans the results. |
What changes were proposed in this pull request?
This PR proposes to add a note that saying the length of a scalar Pandas UDF's
Series
is not of the whole input column but of the batch.We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF.
For example, please consider this example:
How was this patch tested?
Manually built the doc and checked the output.