Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jun 13, 2023

What changes were proposed in this pull request?

This PR is a followup of #39572 that implements to use large variable types within PySpark everywhere.

#39572 implemented the core logic but it only supports large variable types in the bold cases below:

  • mapInArrow: JVM -> Python -> JVM
  • Pandas UDF/Function API: JVM -> Python -> JVM
  • createDataFrame with Arrow: Python -> JVM
  • toPandas with Arrow: JVM -> Python

This PR completes them all.

Why are the changes needed?

To consistently support the large variable types.

Does this PR introduce any user-facing change?

spark.sql.execution.arrow.useLargeVarTypes is not released out yet so it doesn't affect any end users.

How was this patch tested?

Existing tests with spark.sql.execution.arrow.useLargeVarTypes enabled.

@Kimahriman
Copy link
Contributor

Yeah I was trying to understand all the possible flows and definitely knew it wasn't covering the Python originated ones. It looked like all the things originating from the JVM detected the schema from the arrow data coming back. But you're right about the Pandas UDF one, it's the special one that does the pandas to arrow conversion based on the python to_arrow_type. Thanks for the follow up here, it's definitely useful to have it for all those cases as well.

Sorta unrelated but curious if you think this should ever be enabled by default, since most of the rest of Spark tries to avoid 2GiB limits in any places?

@HyukjinKwon
Copy link
Member Author

I actually personally would like to enable this by default ... but one concern is that people who rely on the regular strings instead of large variable types .... so not sure ..

@Kimahriman
Copy link
Contributor

I wonder what downstream affects there could be? I would think it would be mostly transparent, the main thing I guess would be if downstream consumers of arrow (things like polars?) support the large variable width types.

@HyukjinKwon
Copy link
Member Author

Yeah, that's the same thought from me. It'd be a matter if you use Arrow batches directly, otherwise, I don't think it much matters.

@HyukjinKwon
Copy link
Member Author

Okay, I think this is blocked by apache/arrow#35289. NumPy <> Arrow large variable types are not implemented.

@HyukjinKwon
Copy link
Member Author

Let me drop this Pr, and make spark.sql.execution.arrow.useLargeVarTypes configuration as an internal one for now since this feature is not complete.

@Kimahriman
Copy link
Contributor

Any reason not to keep going with this PR and just leave disabled by default?

Also what exactly are the cases where that numpy limitation could cause a problem? Does createDataFrame with arrow accept numpy arrays that get converted for you?

dongjoon-hyun pushed a commit that referenced this pull request Jun 14, 2023
…eVarTypes` as an internal configuration

### What changes were proposed in this pull request?

This PR is a followup of #39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration.

### Why are the changes needed?

As described in #41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now.

### Does this PR introduce _any_ user-facing change?

No, this configuration was not released out yet.

### How was this patch tested?

Ran the Scala linter.

Closes #41584 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@HyukjinKwon
Copy link
Member Author

There are test failures in https://github.com/HyukjinKwon/spark/runs/14214694569. Basically there's no implementation of large var types in NumPy so you can't create a pandas DataFrame from PyArrow with large var types.

@Kimahriman
Copy link
Contributor

Ah right I kinda forgot pandas series are essentially numpy arrays

@Kimahriman
Copy link
Contributor

Would be great to get some looks at #38624. Apply only has a pandas version right now so there's no way around the 2GiB limitation afaik

czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…eVarTypes` as an internal configuration

### What changes were proposed in this pull request?

This PR is a followup of apache#39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration.

### Why are the changes needed?

As described in apache#41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now.

### Does this PR introduce _any_ user-facing change?

No, this configuration was not released out yet.

### How was this patch tested?

Ran the Scala linter.

Closes apache#41584 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@Kimahriman
Copy link
Contributor

Attempted a PR for the arrow issue: apache/arrow#36701. Though after doing some digging I think that was only causing one test to fail that's a weird case of trying to convert a double to a string as part of the arrow conversion. Arrow already supports converting pandas series of strings to large_string type (when the numpy type is object), but not a numpy string list (when numpy type is utf8). The former goes through https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L324C9-L324C26 instead of the other Visit paths.

The other test failures were just due to arrow not having large type support when looking up the numpy type for an arrow type (also added that to the above PR). That can be fixed on the Spark side by just using np.object explicitly for string and binary types, but hitting a weird new test issue I'm trying to figure out.

@Kimahriman
Copy link
Contributor

Ah it's because ArrowConverters doesn't check the config for large types for WithoutSchema things, so you get a mismatch in the Java vector allocated for the record batch.

@HyukjinKwon HyukjinKwon deleted the SPARK-39979-followup branch January 15, 2024 00:52
HyukjinKwon added a commit that referenced this pull request Feb 5, 2025
…ateDataFrame and toPandas with Arrow

### What changes were proposed in this pull request?

This PR is a retry of #41569 that implements to use large variable types within PySpark everywhere.

#39572 implemented the core logic but it only supports large variable types in the bold cases below:

- `mapInArrow`: **JVM -> Python -> JVM**
- Pandas UDF/Function API: **JVM -> Python** -> JVM
- createDataFrame with Arrow: Python -> JVM
- toPandas with Arrow: JVM -> Python

This PR completes them all.

### Why are the changes needed?

To consistently support the large variable types.

### Does this PR introduce _any_ user-facing change?

`spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users.

### How was this patch tested?

Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled.

Closes #49790 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Feb 5, 2025
…ateDataFrame and toPandas with Arrow

### What changes were proposed in this pull request?

This PR is a retry of #41569 that implements to use large variable types within PySpark everywhere.

#39572 implemented the core logic but it only supports large variable types in the bold cases below:

- `mapInArrow`: **JVM -> Python -> JVM**
- Pandas UDF/Function API: **JVM -> Python** -> JVM
- createDataFrame with Arrow: Python -> JVM
- toPandas with Arrow: JVM -> Python

This PR completes them all.

### Why are the changes needed?

To consistently support the large variable types.

### Does this PR introduce _any_ user-facing change?

`spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users.

### How was this patch tested?

Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled.

Closes #49790 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e2ef5a4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
zecookiez pushed a commit to zecookiez/spark that referenced this pull request Feb 6, 2025
…ateDataFrame and toPandas with Arrow

### What changes were proposed in this pull request?

This PR is a retry of apache#41569 that implements to use large variable types within PySpark everywhere.

apache#39572 implemented the core logic but it only supports large variable types in the bold cases below:

- `mapInArrow`: **JVM -> Python -> JVM**
- Pandas UDF/Function API: **JVM -> Python** -> JVM
- createDataFrame with Arrow: Python -> JVM
- toPandas with Arrow: JVM -> Python

This PR completes them all.

### Why are the changes needed?

To consistently support the large variable types.

### Does this PR introduce _any_ user-facing change?

`spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users.

### How was this patch tested?

Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled.

Closes apache#49790 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants