Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4714][CORE]: Check block have removed or not after got info lock in dropFromMemory #3574

Closed
wants to merge 3 commits into from

Conversation

suyanNone
Copy link
Contributor

[SPARK-4714][CORE]: Check block have removed or not after got info lock in dropFromMemory

After got info lock in one of them: removeBlock/dropOldBlocks/dropFromMemory method in BlockManager, the block that info represented may have already removed.

The Three method have the same logic to got info lock:

   info = blockInfo.get(id)
   if (info != null) {
     info.synchronized {
       do sth
     } 
   }

So it is chanced that when one thread got info.synchronized but it already removed from other threads who got info.synchronized before him.

In removeBlock and dropOldBlocks is idempotent if they not have blockinfo.get(blockId) != null
But in dropFromMemory it may be problematic for it may drop block data(which already removed) into diskstore, this method calls data store operations that should not handle missing blocks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

if (!removedFromMemory && !removedFromDisk && !removedFromTachyon) {
logWarning(s"Block $blockId could not be removed as it was not found in either " +
"the disk, memory, or tachyon store")
if (blockInfo.get(blockId).isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this seems like the opposite of the condition we want here: if blockInfo doesn't contain an entry for blockId, then why would we want to remove that entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ThreadA into RemoveBlock(), and got info for blockId1
  2. ThreadB into DropFromMemory(), and got info for blockId1

now Thread A, B all want got info.sychronized

B got, and drop block(blockId) from memory, and this block is use memory only, so it remove from blockinfo. and then release the lock.

then A got, but info is already not in blockinfo.

Did I have miss sth or misunderstand sth? may dropForMemory and removeBlock can't happen at the same time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this extra check is necessary; check out my comment on the main pull request and see if you agree.

Even if we did want to add a check here, I think we want to check for if(blockInfo.get(blockId).nonEmpty), since this branch handles the case where blocks have not been removed already.

@JoshRosen
Copy link
Contributor

Can you elaborate on the problem that this PR addresses? Have you observed an actual error that this fixes?

@suyanNone
Copy link
Contributor Author

@JoshRosen
ThreadA into RemoveBlock(), and got info for blockId1
ThreadB into DropFromMemory(), and got info for blockId1
now Thread A, B all want got info.sychronized

B got, and drop block(blockId) from memory, and this block is use memory only, so it remove from blockinfo. and then release the lock.

then A got, but info is already not in blockinfo.

Did I have miss sth or misunderstand sth? may dropForMemory and removeBlock can't happen at the same time?

@JoshRosen
Copy link
Contributor

Ah, I think I see your concern: let's say that we a block and there are two threads that are racing to perform operations on it: to use your example, thread A wants to call removeBlock() and thread B wants to call dropFromMemory(). For this code to work correctly, we want it to operate correctly for all possible interleavings of those threads

If we consider the case where all of thread A's steps execute before any of thread B's, then things work okay: thread A will have removed the entry from blockInfo before thread B runs, so B will see that info == null and log a warning that the block has already been removed. The same is true for B before A.

In another execution, though, both threads could find the BlockInfo instance in the blockInfo map but race on acquiring its lock (info.synchronized), so info != null will be true for both threads. I agree that this could be a problem, but it might not be if the operations performed in those threads are idempotent. Let's take a look and see if that's the case:

  • removeBlock: all of the operations here are idempotent, so at worst we get a warning if we run this on a block that's removed by another racing thread.
  • dropOldBlocks: similarly, this just consists of calls to *Store.remove(), which are idempotent.
  • dropFromMemory: this case might actually be problematic, since I think that this method calls data store operations that don't handle missing blocks. I'm going to look at this case in a little more detail, but I think that your fix for this might be a good idea.

if (level.useOffHeap) { tachyonStore.remove(id) }
iterator.remove()
logInfo(s"Dropped block $id")
if (blockInfo.get(id).isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I don't think that we strictly need a check here since the remove(id) operations are idempotent. It might be nice to log a warning if the block has already been removed, but that might not be necessary since this is a background cleanup call.

@suyanNone suyanNone force-pushed the refine-block-concurrency branch from 0c1a249 to 55fa4ba Compare December 8, 2014 05:10
@suyanNone
Copy link
Contributor Author

@JoshRosen
yes, it will not cause any error if there not have a check in removeBlock and dropOldBlocks
I pull this request, because I think it will be more reasonable in semantic.
and I already remove the check in removeBlock and dropOldBlocks

if (blockInfo.get(blockId).isEmpty) {
logWarning(s"Block $blockId was already dropped.")
return None
} else if(!info.waitForReady()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style nit: this needs a space after the if and before the open paren: if (!info....

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Dec 8, 2014

Test build #24225 has started for PR 3574 at commit 55fa4ba.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 8, 2014

Test build #24225 has finished for PR 3574 at commit 55fa4ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24225/
Test PASSed.

@@ -1010,7 +1010,10 @@ private[spark] class BlockManager(
info.synchronized {
// required ? As of now, this will be invoked only for blocks which are ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment actually refers to the !info.waitForReady() case, so I'd like to either move the comment or swap the order of these checks so that we check for blockInfo.get(blockId).isEmpty in the else if clause instead.

@JoshRosen
Copy link
Contributor

Left one minor code organization comment; aside from that, this looks good to me and should be ready to merge after you fix that up (I can do it if you don't have time, though; just let me know).

There are a couple of edits that I'd like to make to the commit title / description before merging this, but I can do it myself on merge.

Thanks for the careful analysis and for catching this issue!

@suyanNone suyanNone changed the title [SPARK-4714][CORE]: Add checking info is null or not while having gotten info.syn [SPARK-4714][CORE]: Check block have removed or not after got info lock in dropFromMemory Dec 9, 2014
@suyanNone
Copy link
Contributor Author

@JoshRosen I refine the code according your comments.
If still have problem, it OK for you to fix up including the title and comments, and thanks for you to check my code and give suggestion.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24234 has started for PR 3574 at commit edb989d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24234 has finished for PR 3574 at commit edb989d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24234/
Test PASSed.

@JoshRosen
Copy link
Contributor

LGTM. I like how this ended up evolving into a very small, focused fix.

I'm going to fix up the title / commit message and merge this into master. Even though I don't think we've observed failures from this in the wild, I'm in favor of backporting this into the maintenance branches, too, since it probably won't cause any harm to add more defensive checks here. I'll merge this into branch-1.0 and branch-1.1 and hold off on branch-1.2 for now (I'll add a backport-needed JIRA label and pull this in after 1.2.0 is released).

Thanks!

asfgit pushed a commit that referenced this pull request Dec 9, 2014
… has been removed after synchronizing on BlockInfo instance.

After synchronizing on the `info` lock in the `removeBlock`/`dropOldBlocks`/`dropFromMemory` methods in BlockManager, the block that `info` represented may have already removed.

The three methods have the same logic to get the `info` lock:
```
   info = blockInfo.get(id)
   if (info != null) {
     info.synchronized {
       // do something
     }
   }
```

So, there is chance that when a thread enters the `info.synchronized` block, `info` has already been removed from the `blockInfo` map by some other thread who entered `info.synchronized` first.

The `removeBlock` and `dropOldBlocks` methods are idempotent, so it's safe for them to run on blocks that have already been removed.
But in `dropFromMemory` it may be problematic since it may drop block data which already removed into the diskstore, and this calls data store operations that are not designed to handle missing blocks.

This patch fixes this issue by adding a check to `dropFromMemory` to test whether blocks have been removed by a racing thread.

Author: hushan[胡珊] <hushan@xiaomi.com>

Closes #3574 from suyanNone/refine-block-concurrency and squashes the following commits:

edb989d [hushan[胡珊]] Refine code style and comments position
55fa4ba [hushan[胡珊]] refine code
e57e270 [hushan[胡珊]] add check info is already remove or not while having gotten info.syn

(cherry picked from commit 30dca92)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 9, 2014
… has been removed after synchronizing on BlockInfo instance.

After synchronizing on the `info` lock in the `removeBlock`/`dropOldBlocks`/`dropFromMemory` methods in BlockManager, the block that `info` represented may have already removed.

The three methods have the same logic to get the `info` lock:
```
   info = blockInfo.get(id)
   if (info != null) {
     info.synchronized {
       // do something
     }
   }
```

So, there is chance that when a thread enters the `info.synchronized` block, `info` has already been removed from the `blockInfo` map by some other thread who entered `info.synchronized` first.

The `removeBlock` and `dropOldBlocks` methods are idempotent, so it's safe for them to run on blocks that have already been removed.
But in `dropFromMemory` it may be problematic since it may drop block data which already removed into the diskstore, and this calls data store operations that are not designed to handle missing blocks.

This patch fixes this issue by adding a check to `dropFromMemory` to test whether blocks have been removed by a racing thread.

Author: hushan[胡珊] <hushan@xiaomi.com>

Closes #3574 from suyanNone/refine-block-concurrency and squashes the following commits:

edb989d [hushan[胡珊]] Refine code style and comments position
55fa4ba [hushan[胡珊]] refine code
e57e270 [hushan[胡珊]] add check info is already remove or not while having gotten info.syn

(cherry picked from commit 30dca92)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 30dca92 Dec 9, 2014
@JoshRosen
Copy link
Contributor

I've merged this into branch-1.2.

asfgit pushed a commit that referenced this pull request Dec 17, 2014
… has been removed after synchronizing on BlockInfo instance.

After synchronizing on the `info` lock in the `removeBlock`/`dropOldBlocks`/`dropFromMemory` methods in BlockManager, the block that `info` represented may have already removed.

The three methods have the same logic to get the `info` lock:
```
   info = blockInfo.get(id)
   if (info != null) {
     info.synchronized {
       // do something
     }
   }
```

So, there is chance that when a thread enters the `info.synchronized` block, `info` has already been removed from the `blockInfo` map by some other thread who entered `info.synchronized` first.

The `removeBlock` and `dropOldBlocks` methods are idempotent, so it's safe for them to run on blocks that have already been removed.
But in `dropFromMemory` it may be problematic since it may drop block data which already removed into the diskstore, and this calls data store operations that are not designed to handle missing blocks.

This patch fixes this issue by adding a check to `dropFromMemory` to test whether blocks have been removed by a racing thread.

Author: hushan[胡珊] <hushan@xiaomi.com>

Closes #3574 from suyanNone/refine-block-concurrency and squashes the following commits:

edb989d [hushan[胡珊]] Refine code style and comments position
55fa4ba [hushan[胡珊]] refine code
e57e270 [hushan[胡珊]] add check info is already remove or not while having gotten info.syn

(cherry picked from commit 30dca92)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants