Skip to content

Commit

Permalink
[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of ea…
Browse files Browse the repository at this point in the history
…ch batch within scalar Pandas UDF

## What changes were proposed in this pull request?

This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch.

We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF.

For example, please consider this example:

```python
from pyspark.sql.functions import pandas_udf, col, lit

df = spark.range(1)
f = pandas_udf(lambda x, y: len(x) + y, LongType())
df.select(f(lit('text'), col('id'))).show()
```

```
+------------------+
|<lambda>(text, id)|
+------------------+
|                 1|
+------------------+
```

```python
from pyspark.sql.functions import udf, col, lit

df = spark.range(1)
f = udf(lambda x, y: len(x) + y, "long")
df.select(f(lit('text'), col('id'))).show()
```

```
+------------------+
|<lambda>(text, id)|
+------------------+
|                 4|
+------------------+
```

## How was this patch tested?

Manually built the doc and checked the output.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20237 from HyukjinKwon/SPARK-22980.
  • Loading branch information
HyukjinKwon committed Jan 13, 2018
1 parent 55dbfbc commit cd9f49a
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
| 8| JOHN DOE| 22|
+----------+--------------+------------+
.. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
column, but is the length of an internal batch used for each call to the function.
Therefore, this can be used, for example, to ensure the length of each returned
`pandas.Series`, and can not be used as the column length.
2. GROUP_MAP
A group map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`
Expand Down

0 comments on commit cd9f49a

Please sign in to comment.