-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49808][SQL] Fix a deadlock in subquery execution due to lazy vals #48279
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We should say more about how the dead lock happens, e.g. the order how these two threads obtain locks. |
done, updated in the pr desc |
thanks, merging to master! |
attilapiros
pushed a commit
to attilapiros/spark
that referenced
this pull request
Oct 4, 2024
### What changes were proposed in this pull request? Fix a deadlock in subquery execution due to lazy vals ### Why are the changes needed? we observed a deadlock between `QueryPlan.canonicalized` and `QueryPlan.references`: ``` 24/09/04 04:46:54 ERROR DeadlockDetector: Found 2 new deadlock thread(s): "ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f owned by "subquery-5" Id=112 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684) - blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716) at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716) ... "subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101) - blocked on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187) at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157) at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53) ``` The main thread `TakeOrderedAndProject.doExecute` is trying to compute `outputOrdering`, it top-down traverse the tree, and requires the lock of `QueryPlan.canonicalized` in the path. In this deadlock, it successfully obtained the lock of `WholeStageCodegenExec` and requires the lock of `HashAggregateExec`; Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via `def consume`, which checks `WholeStageCodegenExec.usedInputs` and refererences a lazy val `QueryPlan.references`. It requires the lock of `QueryPlan.references` in the path. In this deadlock, it successfully obtained the lock of `HashAggregateExec` and requires the lock of `WholeStageCodegenExec`; This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually checked with `com.databricks.spark.sql.SubquerySuite` we encountered this issue multiple times before this fix in `SubquerySuite`, and after this fix we didn't hit this issue in multiple runs. ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48279 from zhengruifeng/fix_deadlock. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
It needs more discussion on how to resolve this issue, let me revert it for now |
will be reverted in #48362 |
himadripal
pushed a commit
to himadripal/spark
that referenced
this pull request
Oct 19, 2024
### What changes were proposed in this pull request? Fix a deadlock in subquery execution due to lazy vals ### Why are the changes needed? we observed a deadlock between `QueryPlan.canonicalized` and `QueryPlan.references`: ``` 24/09/04 04:46:54 ERROR DeadlockDetector: Found 2 new deadlock thread(s): "ScalaTest-run-running-SubquerySuite" prio=5 Id=1 BLOCKED on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f owned by "subquery-5" Id=112 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:684) - blocked on org.apache.spark.sql.execution.aggregate.HashAggregateExec87abc7f at app//org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:684) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$doCanonicalize$2(QueryPlan.scala:716) at app//org.apache.spark.sql.catalyst.plans.QueryPlan$$Lambda$4058/0x00007f740f3d0cb0.apply(Unknown Source) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314) at app//org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1313) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.mapChildren(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:716) ... "subquery-5" daemon prio=5 Id=112 BLOCKED on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 owned by "ScalaTest-run-running-SubquerySuite" Id=1 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references$lzycompute(QueryPlan.scala:101) - blocked on org.apache.spark.sql.execution.WholeStageCodegenExec132a3243 at app//org.apache.spark.sql.catalyst.plans.QueryPlan.references(QueryPlan.scala:101) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.CodegenSupport.usedInputs$(WholeStageCodegenExec.scala:325) at app//org.apache.spark.sql.execution.WholeStageCodegenExec.usedInputs(WholeStageCodegenExec.scala:639) at app//org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:187) at app//org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:157) at app//org.apache.spark.sql.execution.aggregate.HashAggregateExec.consume(HashAggregateExec.scala:53) ``` The main thread `TakeOrderedAndProject.doExecute` is trying to compute `outputOrdering`, it top-down traverse the tree, and requires the lock of `QueryPlan.canonicalized` in the path. In this deadlock, it successfully obtained the lock of `WholeStageCodegenExec` and requires the lock of `HashAggregateExec`; Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via `def consume`, which checks `WholeStageCodegenExec.usedInputs` and refererences a lazy val `QueryPlan.references`. It requires the lock of `QueryPlan.references` in the path. In this deadlock, it successfully obtained the lock of `HashAggregateExec` and requires the lock of `WholeStageCodegenExec`; This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually checked with `com.databricks.spark.sql.SubquerySuite` we encountered this issue multiple times before this fix in `SubquerySuite`, and after this fix we didn't hit this issue in multiple runs. ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48279 from zhengruifeng/fix_deadlock. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Fix a deadlock in subquery execution due to lazy vals
Why are the changes needed?
we observed a deadlock between
QueryPlan.canonicalized
andQueryPlan.references
:The main thread
TakeOrderedAndProject.doExecute
is trying to computeoutputOrdering
, it top-down traverse the tree, and requires the lock ofQueryPlan.canonicalized
in the path.In this deadlock, it successfully obtained the lock of
WholeStageCodegenExec
and requires the lock ofHashAggregateExec
;Concurrently, a subquery execution thread is performing code generation and bottom-up traverses the tree via
def consume
, which checksWholeStageCodegenExec.usedInputs
and refererences a lazy valQueryPlan.references
. It requires the lock ofQueryPlan.references
in the path.In this deadlock, it successfully obtained the lock of
HashAggregateExec
and requires the lock ofWholeStageCodegenExec
;This is due to Scala's lazy val internally calls this.synchronized on the instance that contains the val. This creates a potential for deadlocks.
Does this PR introduce any user-facing change?
no
How was this patch tested?
manually checked with
SubquerySuite
we encountered this issue multiple times before this fix in
SubquerySuite
, and after this fix we didn't hit this issue in multiple runs.Was this patch authored or co-authored using generative AI tooling?
no