-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39979][SQL][FOLLOW-UP] Support large variable types in pandas UDF, createDataFrame and toPandas with Arrow #41569
Conversation
Yeah I was trying to understand all the possible flows and definitely knew it wasn't covering the Python originated ones. It looked like all the things originating from the JVM detected the schema from the arrow data coming back. But you're right about the Pandas UDF one, it's the special one that does the pandas to arrow conversion based on the python to_arrow_type. Thanks for the follow up here, it's definitely useful to have it for all those cases as well. Sorta unrelated but curious if you think this should ever be enabled by default, since most of the rest of Spark tries to avoid 2GiB limits in any places? |
I actually personally would like to enable this by default ... but one concern is that people who rely on the regular strings instead of large variable types .... so not sure .. |
I wonder what downstream affects there could be? I would think it would be mostly transparent, the main thing I guess would be if downstream consumers of arrow (things like polars?) support the large variable width types. |
Yeah, that's the same thought from me. It'd be a matter if you use Arrow batches directly, otherwise, I don't think it much matters. |
Okay, I think this is blocked by apache/arrow#35289. NumPy <> Arrow large variable types are not implemented. |
Let me drop this Pr, and make |
Any reason not to keep going with this PR and just leave disabled by default? Also what exactly are the cases where that numpy limitation could cause a problem? Does createDataFrame with arrow accept numpy arrays that get converted for you? |
…eVarTypes` as an internal configuration ### What changes were proposed in this pull request? This PR is a followup of #39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration. ### Why are the changes needed? As described in #41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now. ### Does this PR introduce _any_ user-facing change? No, this configuration was not released out yet. ### How was this patch tested? Ran the Scala linter. Closes #41584 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
There are test failures in https://github.com/HyukjinKwon/spark/runs/14214694569. Basically there's no implementation of large var types in NumPy so you can't create a pandas DataFrame from PyArrow with large var types. |
Ah right I kinda forgot pandas series are essentially numpy arrays |
Would be great to get some looks at #38624. Apply only has a pandas version right now so there's no way around the 2GiB limitation afaik |
…eVarTypes` as an internal configuration ### What changes were proposed in this pull request? This PR is a followup of apache#39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration. ### Why are the changes needed? As described in apache#41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now. ### Does this PR introduce _any_ user-facing change? No, this configuration was not released out yet. ### How was this patch tested? Ran the Scala linter. Closes apache#41584 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Attempted a PR for the arrow issue: apache/arrow#36701. Though after doing some digging I think that was only causing one test to fail that's a weird case of trying to convert a double to a string as part of the arrow conversion. Arrow already supports converting pandas series of strings to large_string type (when the numpy type is object), but not a numpy string list (when numpy type is utf8). The former goes through https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/numpy_to_arrow.cc#L324C9-L324C26 instead of the other The other test failures were just due to arrow not having large type support when looking up the numpy type for an arrow type (also added that to the above PR). That can be fixed on the Spark side by just using np.object explicitly for string and binary types, but hitting a weird new test issue I'm trying to figure out. |
Ah it's because |
…ateDataFrame and toPandas with Arrow ### What changes were proposed in this pull request? This PR is a retry of #41569 that implements to use large variable types within PySpark everywhere. #39572 implemented the core logic but it only supports large variable types in the bold cases below: - `mapInArrow`: **JVM -> Python -> JVM** - Pandas UDF/Function API: **JVM -> Python** -> JVM - createDataFrame with Arrow: Python -> JVM - toPandas with Arrow: JVM -> Python This PR completes them all. ### Why are the changes needed? To consistently support the large variable types. ### Does this PR introduce _any_ user-facing change? `spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users. ### How was this patch tested? Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled. Closes #49790 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ateDataFrame and toPandas with Arrow ### What changes were proposed in this pull request? This PR is a retry of #41569 that implements to use large variable types within PySpark everywhere. #39572 implemented the core logic but it only supports large variable types in the bold cases below: - `mapInArrow`: **JVM -> Python -> JVM** - Pandas UDF/Function API: **JVM -> Python** -> JVM - createDataFrame with Arrow: Python -> JVM - toPandas with Arrow: JVM -> Python This PR completes them all. ### Why are the changes needed? To consistently support the large variable types. ### Does this PR introduce _any_ user-facing change? `spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users. ### How was this patch tested? Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled. Closes #49790 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e2ef5a4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ateDataFrame and toPandas with Arrow ### What changes were proposed in this pull request? This PR is a retry of apache#41569 that implements to use large variable types within PySpark everywhere. apache#39572 implemented the core logic but it only supports large variable types in the bold cases below: - `mapInArrow`: **JVM -> Python -> JVM** - Pandas UDF/Function API: **JVM -> Python** -> JVM - createDataFrame with Arrow: Python -> JVM - toPandas with Arrow: JVM -> Python This PR completes them all. ### Why are the changes needed? To consistently support the large variable types. ### Does this PR introduce _any_ user-facing change? `spark.sql.execution.arrow.useLargeVarTypes` is not released out yet so it doesn't affect any end users. ### How was this patch tested? Existing tests with `spark.sql.execution.arrow.useLargeVarTypes` enabled. Closes apache#49790 from HyukjinKwon/SPARK-39979-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR is a followup of #39572 that implements to use large variable types within PySpark everywhere.
#39572 implemented the core logic but it only supports large variable types in the bold cases below:
mapInArrow
: JVM -> Python -> JVMThis PR completes them all.
Why are the changes needed?
To consistently support the large variable types.
Does this PR introduce any user-facing change?
spark.sql.execution.arrow.useLargeVarTypes
is not released out yet so it doesn't affect any end users.How was this patch tested?
Existing tests with
spark.sql.execution.arrow.useLargeVarTypes
enabled.