Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279

Closed
wants to merge 1 commit into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 27, 2024

What changes were proposed in this pull request?

Fix a deadlock in subquery execution due to lazy vals

Why are the changes needed?

we observed a deadlock between QueryPlan.canonicalized and QueryPlan.references:

"ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec@87abc7f owned by "subquery-5" Id=112
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684)
	-  blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec@87abc7f
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716)
	...


"subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec@132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101)
	-  blocked on org.apache.spark.sql.execution.WholeStageCodegenExec@132a3243
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157)
	at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53)

The main thread TakeOrderedAndProject.doExecute is trying to compute outputOrdering, it top-down traverse the tree, and requires the lock of QueryPlan.canonicalized in the path.
In this deadlock, it successfully obtained the lock of WholeStageCodegenExec and requires the lock of HashAggregateExec;

Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via def consume, which checks WholeStageCodegenExec.usedInputs and refererences a lazy val QueryPlan.references. It requires the lock of QueryPlan.references in the path.
In this deadlock, it successfully obtained the lock of HashAggregateExec and requires the lock of WholeStageCodegenExec;

This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks.

Does this PR introduce any user-facing change?

no

How was this patch tested?

manually checked with SubquerySuite

we encountered this issue multiple times before this fix in SubquerySuite, and after this fix we didn't hit this issue in multiple runs.

Was this patch authored or co-authored using generative AI tooling?

no

@cloud-fan
Copy link
Contributor

We should say more about how the dead lock happens, e.g. the order how these two threads obtain locks.

@zhengruifeng
Copy link
Contributor Author

We should say more about how the dead lock happens, e.g. the order how these two threads obtain locks.

done, updated in the pr desc

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d7abddc Sep 27, 2024
@zhengruifeng zhengruifeng deleted the fix_deadlock branch September 27, 2024 13:57
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Fix a deadlock in subquery execution due to lazy vals

### Why are the changes needed?
we observed a deadlock between `QueryPlan.canonicalized` and `QueryPlan.references`:
```
24/09/04 04:46:54 ERROR DeadlockDetector: Found 2 new deadlock thread(s):
"ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f owned by "subquery-5" Id=112
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684)
	-  blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716)
	...

"subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101)
	-  blocked on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157)
	at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53)
```

The main thread `TakeOrderedAndProject.doExecute` is trying to compute `outputOrdering`, it top-down traverse the tree, and requires the lock of `QueryPlan.canonicalized` in the path.
In this deadlock, it successfully obtained the lock of `WholeStageCodegenExec` and requires the lock of `HashAggregateExec`;

Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via `def consume`, which checks `WholeStageCodegenExec.usedInputs` and refererences a lazy val `QueryPlan.references`. It requires the lock of `QueryPlan.references` in the path.
In this deadlock, it successfully obtained the lock of `HashAggregateExec` and requires the lock of `WholeStageCodegenExec`;

This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manually checked with `com.databricks.spark.sql.SubquerySuite`

we encountered this issue multiple times before this fix in `SubquerySuite`, and after this fix we didn't hit this issue in multiple runs.

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48279 from zhengruifeng/fix_deadlock.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@zhengruifeng
Copy link
Contributor Author

It needs more discussion on how to resolve this issue, let me revert it for now

@zhengruifeng
Copy link
Contributor Author

will be reverted in #48362

himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
Fix a deadlock in subquery execution due to lazy vals

### Why are the changes needed?
we observed a deadlock between `QueryPlan.canonicalized` and `QueryPlan.references`:
```
24/09/04 04:46:54 ERROR DeadlockDetector: Found 2 new deadlock thread(s):
"ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f owned by "subquery-5" Id=112
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684)
	-  blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314)
	at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716)
	...

"subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101)
	-  blocked on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243
	at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325)
	at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187)
	at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157)
	at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53)
```

The main thread `TakeOrderedAndProject.doExecute` is trying to compute `outputOrdering`, it top-down traverse the tree, and requires the lock of `QueryPlan.canonicalized` in the path.
In this deadlock, it successfully obtained the lock of `WholeStageCodegenExec` and requires the lock of `HashAggregateExec`;

Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via `def consume`, which checks `WholeStageCodegenExec.usedInputs` and refererences a lazy val `QueryPlan.references`. It requires the lock of `QueryPlan.references` in the path.
In this deadlock, it successfully obtained the lock of `HashAggregateExec` and requires the lock of `WholeStageCodegenExec`;

This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manually checked with `com.databricks.spark.sql.SubquerySuite`

we encountered this issue multiple times before this fix in `SubquerySuite`, and after this fix we didn't hit this issue in multiple runs.

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48279 from zhengruifeng/fix_deadlock.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants