[SPARK-33277][PYSPARK][SQL][2.4] Use ContextAwareIterator to stop consuming after the task ends. #30913

ueshin · 2020-12-23T23:02:30Z

What changes were proposed in this pull request?

This is a backport of #30899.

This is not a complete fix, but it would take long time to complete (#30242).
As discussed offline, at least using ContextAwareIterator should be helpful enough for many cases.

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use ContextAwareIterator to stop consuming after the task ends.

Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests, and manually.

…g after the task ends ### What changes were proposed in this pull request? This is a retry of apache#30177. This is not a complete fix, but it would take long time to complete (apache#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually.

dongjoon-hyun

+1, LGTM (Pending CIs).
Thank you, @ueshin and @viirya .

HyukjinKwon · 2020-12-24T01:40:17Z

Let me merge this into branch-2.4. SparkR tests are hardly related, and all relevant tests passed.

HyukjinKwon · 2020-12-24T01:40:34Z

Merged to branch-2.4.

…suming after the task ends ### What changes were proposed in this pull request? This is a backport of #30899. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30913 from ueshin/issues/SPARK-33277/2.4/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

SparkQA · 2020-12-24T01:56:17Z

Test build #133326 has finished for PR 30913 at commit 3c47d96.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContextAwareIterator[+T](val context: TaskContext, val delegate: Iterator[T])

ueshin requested review from viirya, HyukjinKwon and dongjoon-hyun December 23, 2020 23:02

viirya approved these changes Dec 23, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 23, 2020

View reviewed changes

HyukjinKwon approved these changes Dec 24, 2020

View reviewed changes

HyukjinKwon closed this Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33277][PYSPARK][SQL][2.4] Use ContextAwareIterator to stop consuming after the task ends. #30913

[SPARK-33277][PYSPARK][SQL][2.4] Use ContextAwareIterator to stop consuming after the task ends. #30913

ueshin commented Dec 23, 2020

dongjoon-hyun left a comment

HyukjinKwon commented Dec 24, 2020

HyukjinKwon commented Dec 24, 2020

SparkQA commented Dec 24, 2020

[SPARK-33277][PYSPARK][SQL][2.4] Use ContextAwareIterator to stop consuming after the task ends. #30913

[SPARK-33277][PYSPARK][SQL][2.4] Use ContextAwareIterator to stop consuming after the task ends. #30913

Conversation

ueshin commented Dec 23, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 24, 2020

HyukjinKwon commented Dec 24, 2020

SparkQA commented Dec 24, 2020