[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

HyukjinKwon · 2023-06-13T08:34:12Z

What changes were proposed in this pull request?

This PR is a followup of #39572 that implements to use large variable types within PySpark everywhere.

#39572 implemented the core logic but it only supports large variable types in the bold cases below:

mapInArrow: JVM -> Python -> JVM
Pandas UDF/Function API: JVM -> Python -> JVM
createDataFrame with Arrow: Python -> JVM
toPandas with Arrow: JVM -> Python

This PR completes them all.

Why are the changes needed?

To consistently support the large variable types.

Does this PR introduce any user-facing change?

spark.sql.execution.arrow.useLargeVarTypes is not released out yet so it doesn't affect any end users.

How was this patch tested?

Existing tests with spark.sql.execution.arrow.useLargeVarTypes enabled.

…das with Arrow

Kimahriman · 2023-06-13T11:34:42Z

Yeah I was trying to understand all the possible flows and definitely knew it wasn't covering the Python originated ones. It looked like all the things originating from the JVM detected the schema from the arrow data coming back. But you're right about the Pandas UDF one, it's the special one that does the pandas to arrow conversion based on the python to_arrow_type. Thanks for the follow up here, it's definitely useful to have it for all those cases as well.

Sorta unrelated but curious if you think this should ever be enabled by default, since most of the rest of Spark tries to avoid 2GiB limits in any places?

HyukjinKwon · 2023-06-13T11:42:32Z

I actually personally would like to enable this by default ... but one concern is that people who rely on the regular strings instead of large variable types .... so not sure ..

Kimahriman · 2023-06-13T17:19:50Z

I wonder what downstream affects there could be? I would think it would be mostly transparent, the main thing I guess would be if downstream consumers of arrow (things like polars?) support the large variable width types.

HyukjinKwon · 2023-06-14T02:49:01Z

Yeah, that's the same thought from me. It'd be a matter if you use Arrow batches directly, otherwise, I don't think it much matters.

HyukjinKwon · 2023-06-14T02:54:13Z

Okay, I think this is blocked by apache/arrow#35289. NumPy <> Arrow large variable types are not implemented.

HyukjinKwon · 2023-06-14T03:33:55Z

Let me drop this Pr, and make spark.sql.execution.arrow.useLargeVarTypes configuration as an internal one for now since this feature is not complete.

Kimahriman · 2023-06-14T15:11:25Z

Any reason not to keep going with this PR and just leave disabled by default?

Also what exactly are the cases where that numpy limitation could cause a problem? Does createDataFrame with arrow accept numpy arrays that get converted for you?

…eVarTypes` as an internal configuration ### What changes were proposed in this pull request? This PR is a followup of #39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration. ### Why are the changes needed? As described in #41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now. ### Does this PR introduce _any_ user-facing change? No, this configuration was not released out yet. ### How was this patch tested? Ran the Scala linter. Closes #41584 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

HyukjinKwon · 2023-06-15T01:09:14Z

There are test failures in https://github.com/HyukjinKwon/spark/runs/14214694569. Basically there's no implementation of large var types in NumPy so you can't create a pandas DataFrame from PyArrow with large var types.

Kimahriman · 2023-06-15T01:55:38Z

Ah right I kinda forgot pandas series are essentially numpy arrays

Kimahriman · 2023-06-15T11:02:36Z

Would be great to get some looks at #38624. Apply only has a pandas version right now so there's no way around the 2GiB limitation afaik

…eVarTypes` as an internal configuration ### What changes were proposed in this pull request? This PR is a followup of apache#39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration. ### Why are the changes needed? As described in apache#41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now. ### Does this PR introduce _any_ user-facing change? No, this configuration was not released out yet. ### How was this patch tested? Ran the Scala linter. Closes apache#41584 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Kimahriman · 2023-07-15T12:26:00Z

Attempted a PR for the arrow issue: apache/arrow#36701. Though after doing some digging I think that was only causing one test to fail that's a weird case of trying to convert a double to a string as part of the arrow conversion. Arrow already supports converting pandas series of strings to large_string type (when the numpy type is object), but not a numpy string list (when numpy type is utf8). The former goes through https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L324C9-L324C26 instead of the other Visit paths.

The other test failures were just due to arrow not having large type support when looking up the numpy type for an arrow type (also added that to the above PR). That can be fixed on the Spark side by just using np.object explicitly for string and binary types, but hitting a weird new test issue I'm trying to figure out.

Kimahriman · 2023-07-15T14:39:36Z

Ah it's because ArrowConverters doesn't check the config for large types for WithoutSchema things, so you get a mismatch in the Java vector allocated for the record batch.

…ateDataFrame and toPandas with Arrow ### What changes were proposed in this pull request? This PR is a retry of #41569 that implements to use large variable types within PySpark everywhere. #39572 implemented the core logic but it only supports large variable types in the bold cases below: - `mapInArrow`: **JVM -> Python -> JVM** - Pandas UDF/Function API: **JVM -> Python** -> JVM - createDataFrame with Arrow: Python -> JVM - toPandas with Arrow: JVM -> Python This PR completes them all. ### Why are the changes needed? To consistently support the large variable types. ### Does this PR introduce _any_ user-facing change? `spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users. ### How was this patch tested? Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled. Closes #49790 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ateDataFrame and toPandas with Arrow ### What changes were proposed in this pull request? This PR is a retry of #41569 that implements to use large variable types within PySpark everywhere. #39572 implemented the core logic but it only supports large variable types in the bold cases below: - `mapInArrow`: **JVM -> Python -> JVM** - Pandas UDF/Function API: **JVM -> Python** -> JVM - createDataFrame with Arrow: Python -> JVM - toPandas with Arrow: JVM -> Python This PR completes them all. ### Why are the changes needed? To consistently support the large variable types. ### Does this PR introduce _any_ user-facing change? `spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users. ### How was this patch tested? Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled. Closes #49790 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e2ef5a4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ateDataFrame and toPandas with Arrow ### What changes were proposed in this pull request? This PR is a retry of apache#41569 that implements to use large variable types within PySpark everywhere. apache#39572 implemented the core logic but it only supports large variable types in the bold cases below: - `mapInArrow`: **JVM -> Python -> JVM** - Pandas UDF/Function API: **JVM -> Python** -> JVM - createDataFrame with Arrow: Python -> JVM - toPandas with Arrow: JVM -> Python This PR completes them all. ### Why are the changes needed? To consistently support the large variable types. ### Does this PR introduce _any_ user-facing change? `spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users. ### How was this patch tested? Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled. Closes apache#49790 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Support large variable types in pandas UDF, createDataFrame and toPan…

4ad4103

…das with Arrow

HyukjinKwon marked this pull request as draft June 13, 2023 08:34

github-actions bot added CONNECT CORE PANDAS API ON SPARK PYTHON SQL labels Jun 13, 2023

HyukjinKwon mentioned this pull request Jun 13, 2023

[SPARK-39979][SQL] Add option to use large variable width vectors for arrow UDF operations #39572

Closed

Enable it by default, and see if tests pass

c01f90b

HyukjinKwon closed this Jun 14, 2023

HyukjinKwon mentioned this pull request Jun 14, 2023

[SPARK-39979][SQL][FOLLOW-UP] Make spark.sql.execution.arrow.useLargeVarTypes as an internal configuration #41584

Closed

HyukjinKwon deleted the SPARK-39979-followup branch January 15, 2024 00:52

HyukjinKwon mentioned this pull request Jan 8, 2025

[SPARK-50765][PYTHON] Remove invalid test for spark.sql.execution.arrow.useLargeVarTypes #49408

Closed

HyukjinKwon mentioned this pull request Feb 4, 2025

[SPARK-51079][PYTHON] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #49790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

HyukjinKwon commented Jun 13, 2023 •

edited

Loading

Kimahriman commented Jun 13, 2023

HyukjinKwon commented Jun 13, 2023

Kimahriman commented Jun 13, 2023

HyukjinKwon commented Jun 14, 2023

HyukjinKwon commented Jun 14, 2023

HyukjinKwon commented Jun 14, 2023

Kimahriman commented Jun 14, 2023

HyukjinKwon commented Jun 15, 2023

Kimahriman commented Jun 15, 2023

Kimahriman commented Jun 15, 2023

Kimahriman commented Jul 15, 2023

Kimahriman commented Jul 15, 2023

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

Conversation

HyukjinKwon commented Jun 13, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Kimahriman commented Jun 13, 2023

HyukjinKwon commented Jun 13, 2023

Kimahriman commented Jun 13, 2023

HyukjinKwon commented Jun 14, 2023

HyukjinKwon commented Jun 14, 2023

HyukjinKwon commented Jun 14, 2023

Kimahriman commented Jun 14, 2023

HyukjinKwon commented Jun 15, 2023

Kimahriman commented Jun 15, 2023

Kimahriman commented Jun 15, 2023

Kimahriman commented Jul 15, 2023

Kimahriman commented Jul 15, 2023

HyukjinKwon commented Jun 13, 2023 •

edited

Loading