[SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release #18076

jiangxb1987 · 2017-05-23T23:23:58Z

What changes were proposed in this pull request?

When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the unlock method.

How was this patch tested?

Add new failing regression test case in RDDSuite.

jiangxb1987 · 2017-05-23T23:26:09Z

cc @JoshRosen @cloud-fan

JoshRosen · 2017-05-24T00:34:20Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -501,14 +501,18 @@ private[spark] class BlockManager(
      case Some(info) =>
        val level = info.level
        logDebug(s"Level for block $blockId is $level")
+        val taskAttemptId = Option(TaskContext.get()).map(_.taskAttemptId())
+          .getOrElse(BlockInfo.NON_TASK_WRITER)


I think we can leave out the .getOrElse here and just pass in the Option itself into releaseLock. This helps to avoid exposure of BlockInfo.NON_TASK_WRITER here. Not a huge deal but just a minor nit.

JoshRosen · 2017-05-24T00:35:46Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

        if (level.useMemory && memoryStore.contains(blockId)) {
          val iter: Iterator[Any] = if (level.deserialized) {
            memoryStore.getValues(blockId).get
          } else {
            serializerManager.dataDeserializeStream(
              blockId, memoryStore.getBytes(blockId).get.toInputStream())(info.classTag)
          }
-          val ci = CompletionIterator[Any, Iterator[Any]](iter, releaseLock(blockId))


I'd add a one-line comment before this line which references SPARK-18406, something like

"We need to capture the current taskId in case the iterator completion is triggered from a different thread which does not have TaskContext set; see SPARK-18406 for discussion"

or similar.

JoshRosen · 2017-05-24T00:37:06Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

@@ -713,8 +718,15 @@ private[spark] class BlockManager(
  /**
   * Release a lock on the given block.
   */
-  def releaseLock(blockId: BlockId): Unit = {
-    blockInfoManager.unlock(blockId)
+  def releaseLock(blockId: BlockId): Unit = releaseLock(blockId, taskAttemptId = None)


Why do we need to overload here? Why not just have a single releaseLock method with a default argument?

In fact BlockManager extends BlockDataManager, so it have to override the releaseLock(blockId: BlockId) method, thus we keep this and implement a new method that accepts the new argument taskAttemptId.

I think there's only one implementation of BlockDataManager these days, though? Since that's an internal interface maybe we could change it there, too?

JoshRosen · 2017-05-24T00:41:07Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+    rdd.cache()
+
+    rdd.mapPartitions { iter =>
+      ThreadUtils.runInNewThread("TestThread") {


Nice use of this helper method. I wasn't aware of this, but it's pretty nice. I'll use it in my own tests going forward.

JoshRosen · 2017-05-24T00:46:56Z

This is a good, clean fix. I left a couple of review comments but they're only minor stylistic comments, not correctness issues.

I checked and it looks like this fixes both occurrences of releasing locks in completion iterators. The only other case that I can think of testing is making sure that BlockManagerManagedBuffer doesn't have a similar problem. It might be worth tackling that separately / as a followup after investigation, though, because we currently don't know of problems there and it's going to be pretty complicated to do a deep-dive so we shouldn't block this PR on that investigation.

SparkQA · 2017-05-24T02:15:25Z

Test build #77266 has finished for PR 18076 at commit 740dc19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-24T02:48:36Z

LGTM

SparkQA · 2017-05-24T03:50:45Z

Test build #77274 has finished for PR 18076 at commit 72cee6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2017-05-24T04:59:19Z

LGTM as well.

SparkQA · 2017-05-24T05:24:22Z

Test build #77278 has finished for PR 18076 at commit bc66ec5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…read lock release ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18076 from jiangxb1987/completion-iterator. (cherry picked from commit d76633e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-05-24T07:54:29Z

thanks, merging to master/2.2! @jiangxb1987 can you send a new PR to backport this to branch 2.1 and 2.0? thanks!

…read lock release When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18076 from jiangxb1987/completion-iterator.

…tion iterator read lock release This is a backport PR of #18076 to 2.0 and 2.1. ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18096 from jiangxb1987/completion-iterator-2.0.

…read lock release When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18076 from jiangxb1987/completion-iterator.

…tion iterator read lock release This is a backport PR of #18076 to 2.1. ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18099 from jiangxb1987/completion-iterator-2.1.

…tion iterator read lock release This is a backport PR of apache#18076 to 2.1. When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18099 from jiangxb1987/completion-iterator-2.1.

bugfix

740dc19

JoshRosen reviewed May 24, 2017

View reviewed changes

update comments and minor refactor.

72cee6e

modify the trait BlockDataManager.

bc66ec5

asfgit closed this in d76633e May 24, 2017

jiangxb1987 mentioned this pull request May 24, 2017

[SPARK-18406][CORE][Backport-2.0] Race between end-of-task and completion iterator read lock release #18096

Closed

jiangxb1987 mentioned this pull request May 24, 2017

[SPARK-18406][CORE][Backport-2.1] Race between end-of-task and completion iterator read lock release #18099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release #18076

[SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release #18076

jiangxb1987 commented May 23, 2017

jiangxb1987 commented May 23, 2017

JoshRosen May 24, 2017

JoshRosen May 24, 2017

JoshRosen May 24, 2017

jiangxb1987 May 24, 2017

JoshRosen May 24, 2017

JoshRosen May 24, 2017

JoshRosen commented May 24, 2017

SparkQA commented May 24, 2017

cloud-fan commented May 24, 2017

SparkQA commented May 24, 2017

JoshRosen commented May 24, 2017

SparkQA commented May 24, 2017

cloud-fan commented May 24, 2017

[SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release #18076

[SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release #18076

Conversation

jiangxb1987 commented May 23, 2017

What changes were proposed in this pull request?

How was this patch tested?

jiangxb1987 commented May 23, 2017

JoshRosen May 24, 2017

Choose a reason for hiding this comment

JoshRosen May 24, 2017

Choose a reason for hiding this comment

JoshRosen May 24, 2017

Choose a reason for hiding this comment

jiangxb1987 May 24, 2017

Choose a reason for hiding this comment

JoshRosen May 24, 2017

Choose a reason for hiding this comment

JoshRosen May 24, 2017

Choose a reason for hiding this comment

JoshRosen commented May 24, 2017

SparkQA commented May 24, 2017

cloud-fan commented May 24, 2017

SparkQA commented May 24, 2017

JoshRosen commented May 24, 2017

SparkQA commented May 24, 2017

cloud-fan commented May 24, 2017