[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279

zhengruifeng · 2024-09-27T08:22:09Z

What changes were proposed in this pull request?

Fix a deadlock in subquery execution due to lazy vals

Why are the changes needed?

we observed a deadlock between QueryPlan.canonicalized and QueryPlan.references:

"ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec@87abc7f owned by "subquery-5" Id=112
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684)
	-  blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec@87abc7f
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716)
	...


"subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec@132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101)
	-  blocked on org.apache.spark.sql.execution.WholeStageCodegenExec@132a3243
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157)
	at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53)

The main thread TakeOrderedAndProject.doExecute is trying to compute outputOrdering, it top-down traverse the tree, and requires the lock of QueryPlan.canonicalized in the path.
In this deadlock, it successfully obtained the lock of WholeStageCodegenExec and requires the lock of HashAggregateExec;

Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via def consume, which checks WholeStageCodegenExec.usedInputs and refererences a lazy val QueryPlan.references. It requires the lock of QueryPlan.references in the path.
In this deadlock, it successfully obtained the lock of HashAggregateExec and requires the lock of WholeStageCodegenExec;

This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks.

Does this PR introduce any user-facing change?

no

How was this patch tested?

manually checked with SubquerySuite

we encountered this issue multiple times before this fix in SubquerySuite, and after this fix we didn't hit this issue in multiple runs.

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2024-09-27T09:32:32Z

We should say more about how the dead lock happens, e.g. the order how these two threads obtain locks.

zhengruifeng · 2024-09-27T10:08:56Z

We should say more about how the dead lock happens, e.g. the order how these two threads obtain locks.

done, updated in the pr desc

cloud-fan · 2024-09-27T13:52:39Z

thanks, merging to master!

### What changes were proposed in this pull request? Fix a deadlock in subquery execution due to lazy vals ### Why are the changes needed? we observed a deadlock between `QueryPlan.canonicalized` and `QueryPlan.references`: ``` 24/09/04 04:46:54 ERROR DeadlockDetector: Found 2 new deadlock thread(s): "ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f owned by "subquery-5" Id=112 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684) - blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716) at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716) ... "subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101) - blocked on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187) at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157) at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53) ``` The main thread `TakeOrderedAndProject.doExecute` is trying to compute `outputOrdering`, it top-down traverse the tree, and requires the lock of `QueryPlan.canonicalized` in the path. In this deadlock, it successfully obtained the lock of `WholeStageCodegenExec` and requires the lock of `HashAggregateExec`; Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via `def consume`, which checks `WholeStageCodegenExec.usedInputs` and refererences a lazy val `QueryPlan.references`. It requires the lock of `QueryPlan.references` in the path. In this deadlock, it successfully obtained the lock of `HashAggregateExec` and requires the lock of `WholeStageCodegenExec`; This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually checked with `com.databricks.spark.sql.SubquerySuite` we encountered this issue multiple times before this fix in `SubquerySuite`, and after this fix we didn't hit this issue in multiple runs. ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48279 from zhengruifeng/fix_deadlock. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

zhengruifeng · 2024-10-07T02:45:53Z

It needs more discussion on how to resolve this issue, let me revert it for now

zhengruifeng · 2024-10-07T02:46:51Z

will be reverted in #48362

### What changes were proposed in this pull request? Fix a deadlock in subquery execution due to lazy vals ### Why are the changes needed? we observed a deadlock between `QueryPlan.canonicalized` and `QueryPlan.references`: ``` 24/09/04 04:46:54 ERROR DeadlockDetector: Found 2 new deadlock thread(s): "ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f owned by "subquery-5" Id=112 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684) - blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716) at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716) ... "subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101) - blocked on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187) at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157) at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53) ``` The main thread `TakeOrderedAndProject.doExecute` is trying to compute `outputOrdering`, it top-down traverse the tree, and requires the lock of `QueryPlan.canonicalized` in the path. In this deadlock, it successfully obtained the lock of `WholeStageCodegenExec` and requires the lock of `HashAggregateExec`; Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via `def consume`, which checks `WholeStageCodegenExec.usedInputs` and refererences a lazy val `QueryPlan.references`. It requires the lock of `QueryPlan.references` in the path. In this deadlock, it successfully obtained the lock of `HashAggregateExec` and requires the lock of `WholeStageCodegenExec`; This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually checked with `com.databricks.spark.sql.SubquerySuite` we encountered this issue multiple times before this fix in `SubquerySuite`, and after this fix we didn't hit this issue in multiple runs. ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48279 from zhengruifeng/fix_deadlock. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

fix

7aaa6b3

github-actions bot added the SQL label Sep 27, 2024

zhengruifeng requested review from JoshRosen and cloud-fan September 27, 2024 08:22

cloud-fan closed this in d7abddc Sep 27, 2024

zhengruifeng deleted the fix_deadlock branch September 27, 2024 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279

[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279

zhengruifeng commented Sep 27, 2024 •

edited by cloud-fan

Loading

cloud-fan commented Sep 27, 2024

zhengruifeng commented Sep 27, 2024

cloud-fan commented Sep 27, 2024

zhengruifeng commented Oct 7, 2024

zhengruifeng commented Oct 7, 2024

[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279

[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279

Conversation

zhengruifeng commented Sep 27, 2024 • edited by cloud-fan Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented Sep 27, 2024

zhengruifeng commented Sep 27, 2024

cloud-fan commented Sep 27, 2024

zhengruifeng commented Oct 7, 2024

zhengruifeng commented Oct 7, 2024

zhengruifeng commented Sep 27, 2024 •

edited by cloud-fan

Loading