[SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to stop consuming after the task ends #30217

ueshin · 2020-11-01T19:59:16Z

What changes were proposed in this pull request?

This is a backport of #30177.

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use ContextAwareIterator to stop consuming after the task ends.

Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests, and manually.

…g after the task ends As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. No. Added tests, and manually. Closes apache#30177 from ueshin/issues/SPARK-33277/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

SparkQA · 2020-11-01T20:49:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35104/

SparkQA · 2020-11-01T21:09:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35104/

HyukjinKwon · 2020-11-02T00:06:42Z

Merged to branch-3.0

…suming after the task ends ### What changes were proposed in this pull request? This is a backport of #30177. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30217 from ueshin/issues/SPARK-33277/3.0/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

SparkQA · 2020-11-02T00:24:10Z

Test build #130500 has finished for PR 30217 at commit a2680b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContextAwareIterator[IN](iter: Iterator[IN], context: TaskContext) extends Iterator[IN]

HyukjinKwon approved these changes Nov 1, 2020

View reviewed changes

HyukjinKwon closed this Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to stop consuming after the task ends #30217

[SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to stop consuming after the task ends #30217

ueshin commented Nov 1, 2020

SparkQA commented Nov 1, 2020

SparkQA commented Nov 1, 2020

HyukjinKwon commented Nov 2, 2020

SparkQA commented Nov 2, 2020

[SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to stop consuming after the task ends #30217

[SPARK-33277][PYSPARK][SQL][3.0] Use ContextAwareIterator to stop consuming after the task ends #30217

Conversation

ueshin commented Nov 1, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 1, 2020

SparkQA commented Nov 1, 2020

HyukjinKwon commented Nov 2, 2020

SparkQA commented Nov 2, 2020