[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends. #30899

ueshin · 2020-12-23T04:05:12Z

What changes were proposed in this pull request?

This is a retry of #30177.

This is not a complete fix, but it would take long time to complete (#30242).
As discussed offline, at least using ContextAwareIterator should be helpful enough for many cases.

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use ContextAwareIterator to stop consuming after the task ends.

Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests, and manually.

HyukjinKwon

LGTM @cloud-fan fyi

viirya · 2020-12-23T05:27:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala

+ * A TaskContext aware iterator.
+ *
+ * As the Python evaluation consumes the parent iterator in a separate thread,
+ * it could consume more data from the parent even after the task ends and the parent is closed.


Maybe add "If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor." here too?

Current phrase is not clear why it is bad to read closed parent.

I added the sentence. Thanks!

viirya

One minor comment.

dongjoon-hyun

Is this targeting for all release branches, master to branch-2.4?

HyukjinKwon · 2020-12-23T06:16:42Z

Yes, @dongjoon-hyun. This is a partial fix, but still bandaids the problem.

dongjoon-hyun · 2020-12-23T06:16:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala

+ * it could consume more data from the parent even after the task ends and the parent is closed.
+ * Thus, we should use ContextAwareIterator to stop consuming after the task ends.
+ */
+class ContextAwareIterator[IN](iter: Iterator[IN], context: TaskContext) extends Iterator[IN] {


This looks like a general class. Can we put this into more general package instead of org.apache.sql.execution.python package as a separate file?

I moved the class to org.apache.spark.util package. cc @gatorsmile

dongjoon-hyun

I commented a suggestion for the package location of new ContextAwareIterator class. It would be great if we can reuse this for the other languages or parts in the future.

[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends. #30899 (comment)

dongjoon-hyun · 2020-12-23T06:20:05Z

cc @gatorsmile for the above package location discussion.

SparkQA · 2020-12-23T07:14:49Z

Test build #133261 has finished for PR 30899 at commit 03c7aac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContextAwareIterator[IN](iter: Iterator[IN], context: TaskContext) extends Iterator[IN]

dongjoon-hyun · 2020-12-23T19:55:20Z

core/src/main/scala/org/apache/spark/util/ContextAwareIterator.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.util


Shall we put this into org.apache.spark because we have InterruptibleIterator there?

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/InterruptibleIterator.scala

Also, maybe we had better add @DeveloperApi annotation like InterruptibleIterator.

sure, updated. Thanks!

…terator.

dongjoon-hyun · 2020-12-23T21:13:17Z

core/src/main/scala/org/apache/spark/ContextAwareIterator.scala

@@ -28,10 +29,12 @@ import org.apache.spark.TaskContext
 * which crashes the executor.
 * Thus, we should use [[ContextAwareIterator]] to stop consuming after the task ends.
 */
-class ContextAwareIterator[IN](iter: Iterator[IN], context: TaskContext) extends Iterator[IN] {
+@DeveloperApi
+class ContextAwareIterator[+T](val context: TaskContext, val delegate: Iterator[T])


Nice! Thank you for revising this consistently with InterruptibleIterator. Now, it looks much better.

dongjoon-hyun

+1, LGTM (Pending CIs). Thanks, @ueshin , @HyukjinKwon , @viirya !

SparkQA · 2020-12-23T21:19:03Z

Test build #133320 has finished for PR 30899 at commit 3606a4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ContextAwareIterator[IN](iter: Iterator[IN], context: TaskContext) extends Iterator[IN]

SparkQA · 2020-12-23T21:36:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37914/

SparkQA · 2020-12-23T22:08:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37914/

…g after the task ends ### What changes were proposed in this pull request? This is a retry of #30177. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5c9b421) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2020-12-23T22:49:45Z

Merged to master/3.1/3.0.

Could you make a backporting PR for branch-2.4, @ueshin ?

dongjoon-hyun · 2020-12-23T22:50:10Z

Thank you, @ueshin , @HyukjinKwon , @viirya .

SparkQA · 2020-12-24T00:56:39Z

Test build #133321 has finished for PR 30899 at commit 36176f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…suming after the task ends ### What changes were proposed in this pull request? This is a backport of #30899. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30913 from ueshin/issues/SPARK-33277/2.4/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Use ContextAwareIterator to stop consuming after the task ends.

03c7aac

ueshin requested a review from HyukjinKwon December 23, 2020 04:05

github-actions bot added PYTHON SQL labels Dec 23, 2020

HyukjinKwon approved these changes Dec 23, 2020

View reviewed changes

viirya reviewed Dec 23, 2020

View reviewed changes

viirya approved these changes Dec 23, 2020

View reviewed changes

dongjoon-hyun reviewed Dec 23, 2020

View reviewed changes

dongjoon-hyun requested changes Dec 23, 2020

View reviewed changes

Move ContextAwareIterator to org.apache.spark.util.

3606a4c

github-actions bot added the CORE label Dec 23, 2020

dongjoon-hyun reviewed Dec 23, 2020

View reviewed changes

Move to org.apache.spark and make it follow the way of InterruptibleI…

36176f7

…terator.

dongjoon-hyun reviewed Dec 23, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 23, 2020

View reviewed changes

dongjoon-hyun closed this in 5c9b421 Dec 23, 2020

ueshin mentioned this pull request Dec 23, 2020

[SPARK-33277][PYSPARK][SQL][2.4] Use ContextAwareIterator to stop consuming after the task ends. #30913

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends. #30899

[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends. #30899

ueshin commented Dec 23, 2020

HyukjinKwon left a comment •

edited

Loading

viirya Dec 23, 2020

ueshin Dec 23, 2020

viirya left a comment

dongjoon-hyun left a comment

HyukjinKwon commented Dec 23, 2020

dongjoon-hyun Dec 23, 2020 •

edited

Loading

ueshin Dec 23, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Dec 23, 2020

SparkQA commented Dec 23, 2020

dongjoon-hyun Dec 23, 2020

ueshin Dec 23, 2020

dongjoon-hyun Dec 23, 2020

dongjoon-hyun left a comment

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

dongjoon-hyun commented Dec 23, 2020

dongjoon-hyun commented Dec 23, 2020

SparkQA commented Dec 24, 2020

[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends. #30899

[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends. #30899

Conversation

ueshin commented Dec 23, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

viirya Dec 23, 2020

Choose a reason for hiding this comment

ueshin Dec 23, 2020

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Dec 23, 2020

dongjoon-hyun Dec 23, 2020 • edited Loading

Choose a reason for hiding this comment

ueshin Dec 23, 2020 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 23, 2020

SparkQA commented Dec 23, 2020

dongjoon-hyun Dec 23, 2020

Choose a reason for hiding this comment

ueshin Dec 23, 2020

Choose a reason for hiding this comment

dongjoon-hyun Dec 23, 2020

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

SparkQA commented Dec 23, 2020

dongjoon-hyun commented Dec 23, 2020

dongjoon-hyun commented Dec 23, 2020

SparkQA commented Dec 24, 2020

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun Dec 23, 2020 •

edited

Loading

ueshin Dec 23, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading