Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40559][PYTHON][DOCS][FOLLOW-UP] Fix the docstring and document both applyInArrows #44139

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions python/docs/source/reference/pyspark.sql/grouping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Grouping

GroupedData.agg
GroupedData.apply
GroupedData.applyInArrow
GroupedData.applyInPandas
GroupedData.applyInPandasWithState
GroupedData.avg
Expand All @@ -36,4 +37,5 @@ Grouping
GroupedData.min
GroupedData.pivot
GroupedData.sum
PandasCogroupedOps.applyInArrow
PandasCogroupedOps.applyInPandas
15 changes: 7 additions & 8 deletions python/pyspark/sql/pandas/group_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -407,7 +407,6 @@ def applyInArrow(
>>> df.groupby("id").applyInArrow(
... normalize, schema="id long, v double").show() # doctest: +SKIP
+---+-------------------+
+---+-------------------+
| id| v|
+---+-------------------+
| 1|-0.7071067811865475|
Expand Down Expand Up @@ -467,7 +466,7 @@ def applyInArrow(
into memory, so the user should be aware of the potential OOM risk if data is skewed
and certain groups are too large to fit in memory.

This API is experimental.
This API is unstable, and for developers.

See Also
--------
Expand Down Expand Up @@ -634,9 +633,9 @@ def applyInArrow(
Applies a function to each cogroup using Arrow and returns the result
as a `DataFrame`.

The function should take two `pyarrow.Table`s and return another
The function should take two `pyarrow.Table`\\s and return another
`pyarrow.Table`. Alternatively, the user can pass a function that takes
a tuple of `pyarrow.Scalar` grouping key(s) and the two `pyarrow.Table`s.
a tuple of `pyarrow.Scalar` grouping key(s) and the two `pyarrow.Table`\\s.
For each side of the cogroup, all columns are passed together as a
`pyarrow.Table` to the user-function and the returned `pyarrow.Table` are combined as
a :class:`DataFrame`.
Expand All @@ -652,9 +651,9 @@ def applyInArrow(
Parameters
----------
func : function
a Python native function that takes two `pyarrow.Table`s, and
a Python native function that takes two `pyarrow.Table`\\s, and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, it fails as below:


/__w/spark/spark/python/pyspark/sql/pandas/group_ops.py:docstring of pyspark.sql.pandas.group_ops.PandasCogroupedOps.applyInArrow:73:Inline interpreted text or phrase reference start-string without end-string.

This is consistent with applyInPandas.

outputs a `pyarrow.Table`, or that takes one tuple (grouping keys) and two
``pyarrow.Table``s, and outputs a ``pyarrow.Table``.
``pyarrow.Table``\\s, and outputs a ``pyarrow.Table``.
schema : :class:`pyspark.sql.types.DataType` or str
the return type of the `func` in PySpark. The value can be either a
:class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.
Expand Down Expand Up @@ -683,7 +682,7 @@ def applyInArrow(
the grouping key(s) will be passed as the first argument and the data will be passed as the
second and third arguments. The grouping key(s) will be passed as a tuple of Arrow scalars
types, e.g., `pyarrow.Int32Scalar` and `pyarrow.FloatScalar`. The data will still be passed
in as two `pyarrow.Table`s containing all columns from the original Spark DataFrames.
in as two `pyarrow.Table`\\s containing all columns from the original Spark DataFrames.

>>> def summarize(key, l, r):
... return pyarrow.Table.from_pydict({
Expand All @@ -707,7 +706,7 @@ def applyInArrow(
into memory, so the user should be aware of the potential OOM risk if data is skewed
and certain groups are too large to fit in memory.

This API is experimental.
This API is unstable, and for developers.

See Also
--------
Expand Down