Skip to content

Commit

Permalink
[ISSUE-392] Fix the bug in the shuffle data cleanup checker that caus…
Browse files Browse the repository at this point in the history
…es false reports of disk corruption (#393)

### What changes were proposed in this pull request?

[ISSUE-392] Fix the bug in the shuffle data cleanup checker that causes false reports of disk corruption #393


### Why are the changes needed?

Fix the bug in the shuffle data checker that causes false reports of disk corruption during cleanup.

```
[INFO] 2022-12-07 16:27:51,411 leakShuffleDataChecker ShuffleTaskManager checkLeakShuffleData - Start check leak shuffle data
[INFO] 2022-12-07 16:27:51,416 leakShuffleDataChecker LocalFileDeleteHandler delete - Delete shuffle data for appId[check] with /data1/uniffle/data/check cost 0 ms
[INFO] 2022-12-07 16:27:51,420 leakShuffleDataChecker LocalFileDeleteHandler delete - Delete shuffle data for appId[check] with /data2/uniffle/data/check cost 0 ms
[INFO] 2022-12-07 16:27:51,420 leakShuffleDataChecker LocalFileDeleteHandler delete - Delete shuffle data for appId[check] with /data3/uniffle/data/check cost 0 ms
[INFO] 2022-12-07 16:27:51,420 leakShuffleDataChecker LocalFileDeleteHandler delete - Delete shuffle data for appId[check] with /data4/uniffle/data/check cost 0 ms
[INFO] 2022-12-07 16:27:51,420 leakShuffleDataChecker ShuffleTaskManager checkLeakShuffleData - Finish check leak shuffle data
[ERROR] 2022-12-07 16:27:51,685 HealthCheckService LocalStorageChecker checkStorageReadAndWrite - Storage read and write error
java.io.FileNotFoundException: /data4/uniffle/data/check/test (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at org.apache.uniffle.server.LocalStorageChecker$StorageInfo.checkStorageReadAndWrite(LocalStorageChecker.java:180)
        at org.apache.uniffle.server.LocalStorageChecker.checkIsHealthy(LocalStorageChecker.java:73)
        at org.apache.uniffle.server.HealthCheck.check(HealthCheck.java:84)
        at org.apache.uniffle.server.HealthCheck.lambda$new$0(HealthCheck.java:70)
        at java.lang.Thread.run(Thread.java:745)
[INFO] 2022-12-07 16:27:51,685 HealthCheckService LocalStorageChecker checkIsHealthy - shuffle server become unhealthy
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

1. UTs
  • Loading branch information
zuston authored Dec 8, 2022
1 parent 8847ece commit 3ec3f41
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
public class LocalStorageChecker extends Checker {

private static final Logger LOG = LoggerFactory.getLogger(LocalStorageChecker.class);
public static final String CHECKER_DIR_NAME = ".check";

private final double diskMaxUsagePercentage;
private final double diskRecoveryUsagePercentage;
Expand Down Expand Up @@ -164,7 +165,8 @@ boolean checkStorageReadAndWrite() {
if (storage.isCorrupted()) {
return false;
}
File checkDir = new File(storageDir, "check");
// Use the hidden file to avoid being cleanup
File checkDir = new File(storageDir, CHECKER_DIR_NAME);
try {
if (!checkDir.mkdirs()) {
return false;
Expand Down Expand Up @@ -196,13 +198,13 @@ boolean checkStorageReadAndWrite() {
} while (readBytes != -1);
}
} catch (Exception e) {
LOG.error("Storage read and write error ", e);
LOG.error("Storage read and write error. Storage dir: {}", storageDir, e);
return false;
} finally {
try {
FileUtils.deleteDirectory(checkDir);
} catch (IOException ioe) {
LOG.error("delete directory fail", ioe);
LOG.error("delete directory fail. Storage dir: {}", storageDir, ioe);
return false;
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -705,8 +705,14 @@ public void checkAndClearLeakShuffleDataTest(@TempDir File tempDir) throws Excep
// make sure heartbeat timeout and resources are removed
Thread.sleep(5000);

// Create the hidden dir to simulate LocalStorageChecker's check
String storageDir = tempDir.getAbsolutePath();
File hiddenFile = new File(storageDir + "/" + LocalStorageChecker.CHECKER_DIR_NAME);
hiddenFile.mkdir();

appIdsOnDisk = getAppIdsOnDisk(localStorageManager);
assertFalse(appIdsOnDisk.contains(appId));
assertFalse(appIdsOnDisk.contains(LocalStorageChecker.CHECKER_DIR_NAME));

// mock leak shuffle data
File file = new File(tempDir, appId);
Expand All @@ -717,6 +723,7 @@ public void checkAndClearLeakShuffleDataTest(@TempDir File tempDir) throws Excep
// execute checkLeakShuffleData
shuffleTaskManager.checkLeakShuffleData();
assertFalse(file.exists());
assertTrue(hiddenFile.exists());
}

private Set<String> getAppIdsOnDisk(LocalStorageManager localStorageManager) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ public Set<String> getAppIds() {
File[] files = baseFolder.listFiles();
if (files != null) {
for (File file : files) {
if (file.isDirectory()) {
if (file.isDirectory() && !file.isHidden()) {
appIds.add(file.getName());
}
}
Expand Down

0 comments on commit 3ec3f41

Please sign in to comment.