[SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup #38624

EnricoMi · 2022-11-11T12:22:36Z

What changes were proposed in this pull request?

Add applyInArrow method to PySpark groupBy and groupBy.cogroup to allow for user functions that work on Arrow. Similar to existing mapInArrow.

Why are the changes needed?

PySpark allows to transform a DataFrame via Pandas and Arrow API:

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas interface, no Arrow interface:

df.groupBy("id").applyInPandas(apply_pandas, schema="...")

Providing a pure Arrow interface allows user code to use any Arrow-based data framework, not only Pandas, e.g. Polars:

def apply_polars(df: polars.DataFrame) -> polars.DataFrame:
  return df

def apply_arrow(table: pyarrow.Table) -> pyarrow.Table:
  df = polars.from_arrow(table)
  return apply_polars(df).to_arrow()

df.groupBy("id").applyInArrow(apply_arrow, schema="...")

Does this PR introduce any user-facing change?

This adds method applyInPandas to PySpark groupBy and groupBy.cogroup.

How was this patch tested?

Tested with unit tests.

AmplabJenkins · 2022-11-11T14:50:46Z

Can one of the admins verify this patch?

EnricoMi · 2022-12-02T14:55:52Z

@HyukjinKwon what do you think?

goodwanghan · 2023-03-05T00:16:21Z

@EnricoMi @HyukjinKwon I think this is a very critical feature that is missing in the current PySpark. Can we consider merging this change?

dev/infra/Dockerfile

EnricoMi · 2023-03-21T08:15:55Z

CC @xinrong-meng

Kimahriman

This would definitely be useful to have! Left one doc typo comment and one question about iteration

python/pyspark/sql/pandas/group_ops.py

Kimahriman · 2023-03-25T13:07:51Z

python/pyspark/sql/pandas/serializers.py

+        batch_iter = [
+            (batch, arrow_type)
+            for batches, arrow_type in iterator  # tuple constructed in wrap_grouped_map_arrow_udf
+            for batch in batches
+        ]
+
+        if self._assign_cols_by_name:
+            batch_iter = [
+                (
+                    pa.RecordBatch.from_arrays(
+                        [batch.column(field.name) for field in arrow_type],
+                        names=[field.name for field in arrow_type],
+                    ),
+                    arrow_type,
+                )
+                for batch, arrow_type in batch_iter
+            ]


Are these list comprehensions going to materialize the entire result set before actually sending anything back to the JVM?

Thanks for highlighting this, I have changed this to a generator.

EnricoMi · 2023-07-18T12:04:07Z

@xinrong-meng @HyukjinKwon rebased with master and conflicts resolved

EnricoMi · 2023-07-18T12:04:28Z

@Kimahriman you mentioned this would fix a 2GB memory limit?

Kimahriman · 2023-07-18T13:39:44Z

@Kimahriman you mentioned this would fix a 2GB memory limit?

Yeah combined with the new setting spark.sql.execution.arrow.useLargeVarTypes it should allow getting around a 2GiB limit on a single string/binary column being returned from a applyInPandas function (by using applyInArrow instead too)

ion-elgreco · 2023-08-02T20:34:13Z

Looking forward to see this PR getting merged :)

igorghi · 2023-08-18T14:02:05Z

Any updates on getting this merged?

ion-elgreco · 2023-08-19T16:27:12Z

@dongjoon-hyun @zhengruifeng @allisonwang-db @xinrong-meng @HyukjinKwon
Are there any updates on this PR? This would be a very useful feature for scaling other data frame libraries that use arrow with spark.

HyukjinKwon · 2023-08-21T01:24:53Z

qq, can't we workaround by df.repartitionByExpression().mapInArrow() for groupby case?

HyukjinKwon · 2023-08-21T01:26:10Z

I get that cogroup might not be possible tho. But we can just convert pandas back to arrow batches easily. Is this really required for some scenario? IIRC this is only useful for addressing nested types.

… both applyInArrows ### What changes were proposed in this pull request? This PR is a followup of #38624 that documents both applyInArrows with a docstring fix. ### Why are the changes needed? For end users to refer the API documentation. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Existing CI, and documentation build. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44139 from HyukjinKwon/SPARK-40559-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…p in Spark Connect ### What changes were proposed in this pull request? This PR implements Spark Connect version of #38624. ### Why are the changes needed? For feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new API for Python Spark Connect client. ### How was this patch tested? Reused unittest and doctests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44146 from HyukjinKwon/connect-arrow-api. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Add `applyInArrow` method to PySpark `groupBy` and `groupBy.cogroup` to allow for user functions that work on Arrow. Similar to existing `mapInArrow`. ### Why are the changes needed? PySpark allows to transform a `DataFrame` via Pandas and Arrow API: ``` df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...") ``` For `df.groupBy(...)` and `df.groupBy(...).cogroup(...)`, there is only a Pandas interface, no Arrow interface: ``` df.groupBy("id").applyInPandas(apply_pandas, schema="...") ``` Providing a pure Arrow interface allows user code to use **any** Arrow-based data framework, not only Pandas, e.g. Polars: ``` def apply_polars(df: polars.DataFrame) -> polars.DataFrame: return df def apply_arrow(table: pyarrow.Table) -> pyarrow.Table: df = polars.from_arrow(table) return apply_polars(df).to_arrow() df.groupBy("id").applyInArrow(apply_arrow, schema="...") ``` ### Does this PR introduce _any_ user-facing change? This adds method `applyInPandas` to PySpark `groupBy` and `groupBy.cogroup`. ### How was this patch tested? Tested with unit tests. Closes apache#38624 from EnricoMi/branch-pyspark-grouped-apply-in-arrow. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to use `inspect.getfullargspec` instead of unimported `getfullargspec`. This PR is a followup of apache#38624. ### Why are the changes needed? To recover the CI. It fails as below: ``` ./python/pyspark/worker.py:749:19: F821 undefined name 'getfullargspec' argspec = getfullargspec(chained_func) # signature was lost when wrapping it ^ ./python/pyspark/worker.py:757:19: F8[21](https://github.com/apache/spark/actions/runs/7080907452/job/19269484124#step:21:22) undefined name 'getfullargspec' argspec = getfullargspec(chained_func) # signature was lost when wrapping it ``` https://github.com/apache/spark/actions/runs/7080907452/job/19269484124 It was caused by the logical conflict w/ apache@f5e4e84 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested via `linter-python`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44141 from HyukjinKwon/SPARK-40559-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… both applyInArrows ### What changes were proposed in this pull request? This PR is a followup of apache#38624 that documents both applyInArrows with a docstring fix. ### Why are the changes needed? For end users to refer the API documentation. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Existing CI, and documentation build. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44139 from HyukjinKwon/SPARK-40559-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…p in Spark Connect ### What changes were proposed in this pull request? This PR implements Spark Connect version of apache#38624. ### Why are the changes needed? For feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new API for Python Spark Connect client. ### How was this patch tested? Reused unittest and doctests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44146 from HyukjinKwon/connect-arrow-api. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

ion-elgreco · 2023-12-05T19:50:34Z

@HyukjinKwon which spark release will we see this feature? :D

EnricoMi · 2023-12-05T19:58:13Z

That will be 4.0.0.

EnricoMi · 2023-12-05T19:58:26Z

@HyukjinKwon thanks for merging!

ion-elgreco · 2023-12-05T19:59:22Z

@EnricoMi do you know by any chance when that is targeted for?

HyukjinKwon · 2023-12-05T21:50:04Z

Around next June

### What changes were proposed in this pull request? Add `applyInArrow` method to PySpark `groupBy` and `groupBy.cogroup` to allow for user functions that work on Arrow. Similar to existing `mapInArrow`. ### Why are the changes needed? PySpark allows to transform a `DataFrame` via Pandas and Arrow API: ``` df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...") ``` For `df.groupBy(...)` and `df.groupBy(...).cogroup(...)`, there is only a Pandas interface, no Arrow interface: ``` df.groupBy("id").applyInPandas(apply_pandas, schema="...") ``` Providing a pure Arrow interface allows user code to use **any** Arrow-based data framework, not only Pandas, e.g. Polars: ``` def apply_polars(df: polars.DataFrame) -> polars.DataFrame: return df def apply_arrow(table: pyarrow.Table) -> pyarrow.Table: df = polars.from_arrow(table) return apply_polars(df).to_arrow() df.groupBy("id").applyInArrow(apply_arrow, schema="...") ``` ### Does this PR introduce _any_ user-facing change? This adds method `applyInPandas` to PySpark `groupBy` and `groupBy.cogroup`. ### How was this patch tested? Tested with unit tests. Closes apache#38624 from EnricoMi/branch-pyspark-grouped-apply-in-arrow. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to use `inspect.getfullargspec` instead of unimported `getfullargspec`. This PR is a followup of apache#38624. ### Why are the changes needed? To recover the CI. It fails as below: ``` ./python/pyspark/worker.py:749:19: F821 undefined name 'getfullargspec' argspec = getfullargspec(chained_func) # signature was lost when wrapping it ^ ./python/pyspark/worker.py:757:19: F8[21](https://github.com/apache/spark/actions/runs/7080907452/job/19269484124#step:21:22) undefined name 'getfullargspec' argspec = getfullargspec(chained_func) # signature was lost when wrapping it ``` https://github.com/apache/spark/actions/runs/7080907452/job/19269484124 It was caused by the logical conflict w/ apache@f5e4e84 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested via `linter-python`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44141 from HyukjinKwon/SPARK-40559-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… both applyInArrows ### What changes were proposed in this pull request? This PR is a followup of apache#38624 that documents both applyInArrows with a docstring fix. ### Why are the changes needed? For end users to refer the API documentation. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Existing CI, and documentation build. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44139 from HyukjinKwon/SPARK-40559-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…p in Spark Connect ### What changes were proposed in this pull request? This PR implements Spark Connect version of apache#38624. ### Why are the changes needed? For feature parity. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new API for Python Spark Connect client. ### How was this patch tested? Reused unittest and doctests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44146 from HyukjinKwon/connect-arrow-api. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR is a sort of a followup of #38624 that proposes to rename the plan nodes for Python as below: From: ``` package org.apache.spark.sql.execution.python MapInBatchExec ├── MapInPandasExec └── *PythonMapInArrowExec* (and *PythonMapInArrow*) *FlatMapCoGroupsInPythonExec* ├── FlatMapCoGroupsInArrowExec └── FlatMapCoGroupsInPandasExec *FlatMapGroupsInPythonExec* ├── FlatMapGroupsInArrowExec └── FlatMapGroupsInPandasExec ``` To: ``` package org.apache.spark.sql.execution.python MapInBatchExec ├── MapInPandasExec └── *MapInArrowExec* (and *MapInArrow*) *FlatMapCoGroupsInBatchExec* ├── FlatMapCoGroupsInArrowExec └── FlatMapCoGroupsInPandasExec *FlatMapGroupsInBatchExec* ├── FlatMapGroupsInArrowExec └── FlatMapGroupsInPandasExec ``` ### Why are the changes needed? To have the consistent names for Python related execution nodes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44373 from HyukjinKwon/minor-arrow-rename. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Add `applyInArrow` method to PySpark `groupBy` and `groupBy.cogroup` to allow for user functions that work on Arrow. Similar to existing `mapInArrow`. PySpark allows to transform a `DataFrame` via Pandas and Arrow API: ``` df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...") ``` For `df.groupBy(...)` and `df.groupBy(...).cogroup(...)`, there is only a Pandas interface, no Arrow interface: ``` df.groupBy("id").applyInPandas(apply_pandas, schema="...") ``` Providing a pure Arrow interface allows user code to use **any** Arrow-based data framework, not only Pandas, e.g. Polars: ``` def apply_polars(df: polars.DataFrame) -> polars.DataFrame: return df def apply_arrow(table: pyarrow.Table) -> pyarrow.Table: df = polars.from_arrow(table) return apply_polars(df).to_arrow() df.groupBy("id").applyInArrow(apply_arrow, schema="...") ``` This adds method `applyInPandas` to PySpark `groupBy` and `groupBy.cogroup`. Tested with unit tests. Closes apache#38624 from EnricoMi/branch-pyspark-grouped-apply-in-arrow. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE PYTHON SQL labels Nov 11, 2022

EnricoMi force-pushed the branch-pyspark-grouped-apply-in-arrow branch from 8a4fdcd to 208ee90 Compare November 25, 2022 11:13

github-actions bot added the BUILD label Nov 25, 2022

goodwanghan mentioned this pull request Feb 17, 2023

[FEATURE] Leverage new Spark mapInArrow and applyInArrow fugue-project/fugue#429

Closed

EnricoMi force-pushed the branch-pyspark-grouped-apply-in-arrow branch 2 times, most recently from c3e5647 to 89d4acc Compare March 6, 2023 10:06

EnricoMi commented Mar 6, 2023

View reviewed changes

dev/infra/Dockerfile Outdated Show resolved Hide resolved

EnricoMi force-pushed the branch-pyspark-grouped-apply-in-arrow branch from 89d4acc to ac936c1 Compare March 13, 2023 08:52

github-actions bot removed the BUILD label Mar 13, 2023

Kimahriman reviewed Mar 25, 2023

View reviewed changes

Kimahriman mentioned this pull request Jun 15, 2023

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

Closed

EnricoMi force-pushed the branch-pyspark-grouped-apply-in-arrow branch from 843b13a to cf86682 Compare June 30, 2023 07:57

EnricoMi force-pushed the branch-pyspark-grouped-apply-in-arrow branch 4 times, most recently from 9705a70 to 839c50a Compare July 18, 2023 08:56

HyukjinKwon mentioned this pull request Dec 4, 2023

[SPARK-46229][PYTHON][CONNECT] Add applyInArrow to groupBy and cogroup in Spark Connect #44146

Closed

EnricoMi deleted the branch-pyspark-grouped-apply-in-arrow branch December 5, 2023 19:58

HyukjinKwon mentioned this pull request Dec 15, 2023

[MINOR][SQL][PYTHON] Rename Python plan node names #44373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup #38624

[SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup #38624

EnricoMi commented Nov 11, 2022 •

edited

Loading

AmplabJenkins commented Nov 11, 2022

EnricoMi commented Dec 2, 2022

goodwanghan commented Mar 5, 2023

EnricoMi commented Mar 21, 2023

Kimahriman left a comment

Kimahriman Mar 25, 2023

EnricoMi May 23, 2023

EnricoMi commented Jul 18, 2023

EnricoMi commented Jul 18, 2023

Kimahriman commented Jul 18, 2023 •

edited

Loading

ion-elgreco commented Aug 2, 2023

igorghi commented Aug 18, 2023

ion-elgreco commented Aug 19, 2023

HyukjinKwon commented Aug 21, 2023

HyukjinKwon commented Aug 21, 2023

ion-elgreco commented Dec 5, 2023

EnricoMi commented Dec 5, 2023

EnricoMi commented Dec 5, 2023

ion-elgreco commented Dec 5, 2023

HyukjinKwon commented Dec 5, 2023

[SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup #38624

[SPARK-40559][PYTHON] Add applyInArrow to groupBy and cogroup #38624

Conversation

EnricoMi commented Nov 11, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Nov 11, 2022

EnricoMi commented Dec 2, 2022

goodwanghan commented Mar 5, 2023

EnricoMi commented Mar 21, 2023

Kimahriman left a comment

Choose a reason for hiding this comment

Kimahriman Mar 25, 2023

Choose a reason for hiding this comment

EnricoMi May 23, 2023

Choose a reason for hiding this comment

EnricoMi commented Jul 18, 2023

EnricoMi commented Jul 18, 2023

Kimahriman commented Jul 18, 2023 • edited Loading

ion-elgreco commented Aug 2, 2023

igorghi commented Aug 18, 2023

ion-elgreco commented Aug 19, 2023

HyukjinKwon commented Aug 21, 2023

HyukjinKwon commented Aug 21, 2023

ion-elgreco commented Dec 5, 2023

EnricoMi commented Dec 5, 2023

EnricoMi commented Dec 5, 2023

ion-elgreco commented Dec 5, 2023

HyukjinKwon commented Dec 5, 2023

EnricoMi commented Nov 11, 2022 •

edited

Loading

Kimahriman commented Jul 18, 2023 •

edited

Loading