-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
udf_apply_feature_dataframe UDF in executor? #458
Comments
as comparison, here is UDF usage with openeo s2_cube = connection.load_collection(
"TERRASCOPE_S2_TOC_V2",
spatial_extent={"west": 4.00, "south": 51.00, "east": 4.01, "north": 51.01},
temporal_extent=["2022-03-01", "2022-03-31"],
bands=["B02"]
)
udf = openeo.UDF("""
import pyspark
from openeo.udf import XarrayDataCube
def apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:
# Executor detection based on pyspark.SparkContext._assert_on_driver
in_executor = (pyspark.TaskContext.get() is not None)
raise ValueError(f"{in_executor=}")
""")
rescaled = s2_cube.apply(process=udf)
rescaled.download("udf-in-executor-apply_datacube-tmp.nc") which fails with |
Note that spark.pandas performs schema inference, by running the apply on the first rows. |
I'm switching over the implementation to use plain RDD's. These do not require to deal with output schemas. Conversion to pandas structures is now done inside the callback. |
(I stumbled on this issue while working on #437 / Open-EO/openeo-python-driver#197)
#251 / #262 added parallelized UDF execution on vector cubes (
udf_apply_feature_dataframe
andudf_apply_udf_data
entrypoints), as documented at https://github.com/Open-EO/openeo-geopyspark-driver/blob/1f0ad56cc749d9f3ade315a85f39f1200f74168c/docs/vectorcube-run_udf.md . The idea was to get parallelization and executor isolation automatically by using the pyspark.pandas withapply
However, it seems that a pyspark.pandas
apply
callback does not run in the executors, but just in the driver.example snippet to illustrate:
This fails with:
Internal: Server error: ValueError('in_executor=False')
indicating the callback did not run in executorThe text was updated successfully, but these errors were encountered: