[SPARK-8029] Robust shuffle writer #9610

davies · 2015-11-11T00:10:36Z

Currently, all the shuffle writer will write to target path directly, the file could be corrupted by other attempt of the same partition on the same executor. They should write to temporary file then rename to target path, as what we do in output committer. In order to make the rename atomic, the temporary file should be created in the same local directory (FileSystem).

This PR is based on #9214 , thanks to @squito . Closes #9214

SparkQA · 2015-11-11T00:41:50Z

Test build #2031 has finished for PR 9610 at commit 7cccfcc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-11-11T02:54:51Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

@@ -93,6 +95,10 @@ private[spark] class IndexShuffleBlockResolver(conf: SparkConf) extends ShuffleB
    } {
      out.close()
    }
+    indexFile.deleteOnExit()
+    if (!tmp.renameTo(indexFile)) {
+      throw new IOException(s"fail to rename index file $tmp to $indexFile")


this will just kill the task, right? both tasks are actually just fine, and in fact the overall job should continue if one of them succeeds. But instead this will lead to the task getting retried, and potentially continuing to fail up to 4 times, though its actually finished successfully from another taskset? You could handle this in scheduler, but that would add some complexity.

There is very little chance that the two concurrent task will call renameTo in the same time, even with that, one of them will succeed, the scheduler will mark the partition as success, and the failure will be ignored (not retried).

Can you test for this? I think the worry was about different TaskSets attempting the same map stage. Imagine that attempt 1 of the stage successfully completes a task, and sends back a map output status, but that status gets ignored because that stage attempt got cancelled. Attempt 2 might then fail to send a new status for it.

There seem to be two ways to fix it if this problem can actually occur -- either add MapOutputStatuses even from failed task sets or mark this new task as successful if a file exists.

On Thu, Nov 12, 2015 at 8:30 AM, Matei Zaharia notifications@github.com
wrote:

In
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
#9610 (comment):

@@ -93,6 +95,10 @@ private[spark] class IndexShuffleBlockResolver(conf: SparkConf) extends ShuffleB
} {
out.close()
}

indexFile.deleteOnExit()

if (!tmp.renameTo(indexFile)) {

throw new IOException(s"fail to rename index file $tmp to $indexFile")

Can you test for this? I think the worry was about different TaskSets
attempting the same map stage. Imagine that attempt 1 of the stage
successfully completes a task, and sends back a map output status, but that
status gets ignored because that stage attempt got cancelled. Attempt 2
might then fail to send a new status for it.

There seem to be two ways to fix it if this problem can actually occur --
either add MapOutputStatuses even from failed task sets or mark this new
task as successful if a file exists.

After this PR, the second attempt of same task will return SUCCESS, with
new MapOutputStatus, which could be different than the previous attempt
(having different sizes of partitions), since we does not use the exact
number of size (could be lossy compressed), I think it's fine.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/9610/files#r44678796.

Davies

SparkQA · 2015-11-11T03:12:02Z

Test build #2032 has finished for PR 9610 at commit 9f0d2f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-11T08:08:34Z

Test build #45600 has finished for PR 9610 at commit 55485a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-11T11:55:29Z

Test build #45618 has finished for PR 9610 at commit 6deccff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-11T19:36:08Z

Test build #2040 has finished for PR 9610 at commit 6deccff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-11T22:17:53Z

Test build #2043 has finished for PR 9610 at commit 6deccff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-11-11T22:40:33Z

cc @mateiz @rxin

squito · 2015-11-12T21:12:00Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

+          dataFile.delete()
+        }
+        if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) {
+          throw new IOException("fail to rename data file " + dataTmp + " to " + dataFile)


I don't think there is a particular flaw here, but its a bit hard to follow since its a mix of first-attempt-wins and last-attempt wins. First attempt if there is a data file & index file; last attempt if its only an index file. the problem w/ last-attempt is that this delete will fail on windows if the file is open for reading, I believe. Though we can't get around that because SPARK-4085 always requires us to delete some files that might be open, in which case we hope that we don't run into this race again on the next retry. It would be nice to minimize that case, though. We'd be closer to first-attempt-wins if we always wrote a dataFile, even if its empty when dataTmp == null.

There is also an issue w/ mapStatus & non-deterministic data. It might not matter which output you get, but the mapstatus should be consistent with the data that is read. If attempt 1 writes non-empty outputs a,b,c, and attempt 2 writes non-empty outputs d,e,f (which are not committed), the reduce tasks might get the mapstatus for attempt 2, look for outputs d,e,f, and get nothing but empty blocks. Matei had suggested writing the mapstatus to a file, so that subsequent attempts always return the map status corresponding to the first successful attempt.

squito · 2015-11-12T21:39:30Z

you can take this test case if you like: https://github.com/squito/spark/blob/SPARK-8029_first_wins/core/src/test/scala/org/apache/spark/ShuffleSuite.scala#L351

I'd also add more test cases to cover the various paths through the output commit code. And I think that ShuffleWriter.write should at least include a comment explaining the need for shuffle writers to write to a tmp file and atomically move them to the final place, since its not obvious why that would be necessary, and I guess there isn't any other good place to stick the explanation.

SparkQA · 2015-11-13T00:57:52Z

Test build #2053 has started for PR 9610 at commit d0b937f.

davies · 2015-11-13T01:24:19Z

@squito Thanks for reviewing this, I had included your regression test here, also added tests for resolver.

andrewor14 · 2015-11-13T01:48:56Z

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

+    val dataFile = getDataFile(shuffleId, mapId)
+    // Note: there is only one IndexShuffleBlockResolver per executor
+    synchronized {
+      val existedLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length)


existingLengths

andrewor14 · 2015-11-13T02:26:03Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala

@@ -65,11 +66,11 @@ private[spark] class SortShuffleWriter[K, V, C](
    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
-    val outputFile = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
+    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
+    val tmp = Utils.tempFileWith(output)


can you call these outputTmp or something so it's slightly easier to follow? (here and other places)

This is only one file output here, I think it's obvious

andrewor14 · 2015-11-13T02:58:58Z

@davies I took a pass and I find the approach correct and simple. I did a close review and confirmed that all four ShuffleWriters write temporary files and rename them correctly on commit. Most of my comments are minor.

SparkQA · 2015-11-13T03:29:18Z

Test build #45800 has finished for PR 9610 at commit 35bd469.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-11-13T04:20:48Z

@andrewor14 Thanks, this looks much better now.

SparkQA · 2015-11-13T06:11:37Z

Test build #45824 has finished for PR 9610 at commit 71b12bf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-11-13T06:45:21Z

The failed test is not related, I'm merging this into master, will create another PR for other branches.

Currently, all the shuffle writer will write to target path directly, the file could be corrupted by other attempt of the same partition on the same executor. They should write to temporary file then rename to target path, as what we do in output committer. In order to make the rename atomic, the temporary file should be created in the same local directory (FileSystem). This PR is based on #9214 , thanks to squito . Closes #9214 Author: Davies Liu <davies@databricks.com> Closes #9610 from davies/safe_shuffle.

SparkQA · 2015-11-13T09:19:25Z

Test build #2054 has finished for PR 9610 at commit 71b12bf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Currently, all the shuffle writer will write to target path directly, the file could be corrupted by other attempt of the same partition on the same executor. They should write to temporary file then rename to target path, as what we do in output committer. In order to make the rename atomic, the temporary file should be created in the same local directory (FileSystem). This PR is based on apache#9214 , thanks to squito . Closes apache#9214 Author: Davies Liu <davies@databricks.com> Closes apache#9610 from davies/safe_shuffle.

Prasidhdh · 2016-05-21T14:37:46Z

@davies : If we want to merge two datafiles (one from first action of rdd and another from second action of same rdd), How can i do that? And do i need to do anything with indexfile?
Please give help me to understand this problem!
Thank you

SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer. (cherry picked from commit 5b8f737) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer.

SPARK-8029 (apache#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer. (cherry picked from commit 5b8f737) Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit 8646b84)

SPARK-8029 (apache#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer.

davies force-pushed the safe_shuffle branch from fbfdda9 to 7cccfcc Compare November 11, 2015 00:29

use temporary file and rename to avoid concurrent write

b2a90c6

fix tests

9f0d2f9

davies force-pushed the safe_shuffle branch from 7cccfcc to 9f0d2f9 Compare November 11, 2015 00:46

squito reviewed Nov 11, 2015
View reviewed changes

address comments

55485a9

fix test

6deccff

squito reviewed Nov 12, 2015
View reviewed changes

Davies Liu added 2 commits November 12, 2015 15:40

address comments

f0c2a5d

rename

d0b937f

add regression test

35bd469

andrewor14 reviewed Nov 13, 2015
View reviewed changes

address comments

71b12bf

asfgit closed this in ad96088 Nov 13, 2015

pzzs mentioned this pull request Apr 27, 2016

[SPARK-4105][CORE] regenerate the shuffle file when it is corrupted #12700

Closed

JoshRosen mentioned this pull request Sep 6, 2016

[SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks #14932

Closed

JoshRosen mentioned this pull request Sep 14, 2016

[SPARK-17547] Ensure temp shuffle data file is cleaned up after error #15104

Closed

cloud-fan mentioned this pull request Sep 3, 2018

[SPARK-23243][Core] Fix RDD.repartition() data correctness issue #22112

Closed

cloud-fan mentioned this pull request Mar 18, 2019

[SPARK-25341][Core] Support rolling back a shuffle map stage and re-generate the shuffle files #24110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8029] Robust shuffle writer #9610

[SPARK-8029] Robust shuffle writer #9610

davies commented Nov 11, 2015

SparkQA commented Nov 11, 2015

squito Nov 11, 2015

davies Nov 11, 2015

mateiz Nov 12, 2015

davies Nov 12, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

davies commented Nov 11, 2015

squito Nov 12, 2015

squito commented Nov 12, 2015

SparkQA commented Nov 13, 2015

davies commented Nov 13, 2015

andrewor14 Nov 13, 2015

andrewor14 Nov 13, 2015

davies Nov 13, 2015

andrewor14 commented Nov 13, 2015

SparkQA commented Nov 13, 2015

davies commented Nov 13, 2015

SparkQA commented Nov 13, 2015

davies commented Nov 13, 2015

SparkQA commented Nov 13, 2015

Prasidhdh commented May 21, 2016

[SPARK-8029] Robust shuffle writer #9610

[SPARK-8029] Robust shuffle writer #9610

Conversation

davies commented Nov 11, 2015

SparkQA commented Nov 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

SparkQA commented Nov 11, 2015

davies commented Nov 11, 2015

Choose a reason for hiding this comment

squito commented Nov 12, 2015

SparkQA commented Nov 13, 2015

davies commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewor14 commented Nov 13, 2015

SparkQA commented Nov 13, 2015

davies commented Nov 13, 2015

SparkQA commented Nov 13, 2015

davies commented Nov 13, 2015

SparkQA commented Nov 13, 2015

Prasidhdh commented May 21, 2016