Introduce FS Health HEALTHY threshold to fail stuck node #1167

Bukhtawar · 2021-08-28T14:40:40Z

Signed-off-by: Bukhtawar Khan bukhtawa@amazon.com

Description

FS health should fail on stuck IO. This would cause node to stuck on IO get eventually removed from the cluster

Issues Resolved

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-08-28T14:45:56Z

✅ Gradle Wrapper Validation success 88f885b

opensearch-ci-bot · 2021-08-28T14:45:56Z

✅ DCO Check Passed 88f885b

opensearch-ci-bot · 2021-08-28T14:54:12Z

✅ Gradle Precommit success 88f885b

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-08-30T05:51:12Z

✅ DCO Check Passed d1789a5

opensearch-ci-bot · 2021-08-30T05:52:02Z

✅ Gradle Wrapper Validation success d1789a5

opensearch-ci-bot · 2021-08-30T06:00:14Z

✅ Gradle Precommit success d1789a5

itiyamas · 2021-08-30T10:03:59Z

server/src/test/java/org/opensearch/monitor/fs/FsHealthServiceTests.java

@@ -347,7 +403,7 @@ public void force(boolean metaData) throws IOException {

    private static class FileSystemFsyncHungProvider extends FilterFileSystemProvider {

-        AtomicBoolean injectIOException = new AtomicBoolean();
+        AtomicBoolean injectIODelay = new AtomicBoolean();


Th delay is achieved by using Thread.sleep. You could use latching to make this more deterministic in tests.

The tests actually does a latch, we are mimicking a stuck IO operations which is being achieved by causing the fsync operation to go into a sleep. I have however improved the mock to make it more deterministic based on your suggestion

itiyamas · 2021-08-30T10:07:48Z

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

            statusInfo = new StatusInfo(HEALTHY, "health check disabled");
        } else if (brokenLock) {
            statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock");
+        } else if (lastRunTimeMillis.get() > Long.MIN_VALUE && currentTimeMillisSupplier.getAsLong() -


This won't catch the if the first ever run for health check does not complete.

Good point. Done!

getsaurabh02 · 2021-08-30T08:08:43Z

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java


    @Nullable
    private volatile Set<Path> unhealthyPaths;

    public static final Setting<Boolean> ENABLED_SETTING =
        Setting.boolSetting("monitor.fs.health.enabled", true, Setting.Property.NodeScope, Setting.Property.Dynamic);
    public static final Setting<TimeValue> REFRESH_INTERVAL_SETTING =
-        Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(120), TimeValue.timeValueMillis(1),
+        Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(60), TimeValue.timeValueMillis(1),


Why do we need to decrease the interval frequency, here?

We want to checks to be more frequent to be able to catch issues faster

getsaurabh02 · 2021-08-30T14:45:01Z

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

+        } else if (lastRunTimeMillis.get() > Long.MIN_VALUE && currentTimeMillisSupplier.getAsLong() -
+            lastRunTimeMillis.get() > refreshInterval.millis() + healthyTimeoutThreshold.millis()) {
+            statusInfo = new StatusInfo(UNHEALTHY, "healthy threshold breached");


Can we simplify this condition by having current run start time, instead of relying on the last run time?

Makes sense. Good point

getsaurabh02 · 2021-08-30T14:45:05Z

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

+    public static final Setting<TimeValue> HEALTHY_TIMEOUT_SETTING =
+        Setting.timeSetting("monitor.fs.health.healthy_timeout_threshold", TimeValue.timeValueSeconds(60), TimeValue.timeValueMillis(1),
+            Setting.Property.NodeScope, Setting.Property.Dynamic);


Given FS Healthcheck has deeper impacts, having retries of lets say 3 makes sense to eliminate any transient issues?

That would either delay the detection or increases chances of false positives. Maybe we can extend support in future. Not expanding the scope of this PR

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-08-30T17:37:51Z

✅ Gradle Wrapper Validation success c7c909c

opensearch-ci-bot · 2021-08-30T17:38:49Z

✅ DCO Check Passed c7c909c

opensearch-ci-bot · 2021-08-30T17:48:12Z

✅ Gradle Precommit success c7c909c

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-08-30T18:36:19Z

✅ Gradle Wrapper Validation success bec0b33

opensearch-ci-bot · 2021-08-30T18:36:49Z

✅ DCO Check Passed bec0b33

opensearch-ci-bot · 2021-08-30T18:39:49Z

✅ Gradle Precommit success bec0b33

itiyamas · 2021-08-30T18:58:18Z

start gradle check

opensearch-ci-bot · 2021-08-30T19:30:00Z

❌ Gradle Check failure bec0b33
Log 446

Reports 446

Bukhtawar · 2021-08-31T05:28:22Z

Tests with failures:
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search/160_exists_query/Test exists query on half_float field in empty index}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=suggest/30_context/Category suggest context from path should work}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search/issue9606/Test search_type=dfs_query_and_fetch not supported from REST layer}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=exists/61_realtime_refresh_with_types/Realtime Refresh}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search.aggregation/40_range/Date Range Missing}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=mget/16_basic_with_types/Basic multi-get}

Tests with failures:
 - org.opensearch.plugins.InstallPluginCommandTests.testOfficialPluginStaging {p0=sun.nio.fs.LinuxFileSystem@f978b40 p1=org.opensearch.plugins.InstallPluginCommandTests$$Lambda$212/0x0000000800d73ba0@24fc089f}
 - org.opensearch.plugins.InstallPluginCommandTests.classMethod

Tests with failures:
 - org.opensearch.http.nio.NioHttpServerTransportTests.testLargeCompressedResponse

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-08-31T06:09:10Z

✅ DCO Check Passed fc17516

opensearch-ci-bot · 2021-08-31T06:09:55Z

✅ Gradle Wrapper Validation success fc17516

opensearch-ci-bot · 2021-08-31T06:19:06Z

✅ Gradle Precommit success fc17516

itiyamas · 2021-08-31T06:36:40Z

start gradle check

opensearch-ci-bot · 2021-08-31T07:02:12Z

❌ Gradle Check failure fc17516
Log 447

Reports 447

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java

opensearch-ci-bot · 2021-09-05T10:14:15Z

✅ Gradle Wrapper Validation success 0e4c57a

opensearch-ci-bot · 2021-09-05T10:14:48Z

✅ DCO Check Passed 0e4c57a

opensearch-ci-bot · 2021-09-05T10:21:51Z

✅ Gradle Precommit success 0e4c57a

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-09-11T17:25:53Z

✅ Gradle Wrapper Validation success a908188

opensearch-ci-bot · 2021-09-11T17:26:48Z

✅ DCO Check Passed a908188

opensearch-ci-bot · 2021-09-11T17:34:56Z

✅ Gradle Precommit success a908188

adnapibar · 2021-09-14T17:04:06Z

start gradle check

opensearch-ci-bot · 2021-09-14T17:24:43Z

❌ Gradle Check failure a908188
Log 523

Reports 523

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

opensearch-ci-bot · 2021-09-15T17:29:20Z

✅ DCO Check Passed 7848646

opensearch-ci-bot · 2021-09-15T17:30:07Z

✅ Gradle Wrapper Validation success 7848646

opensearch-ci-bot · 2021-09-15T17:39:41Z

✅ Gradle Precommit success 7848646

adnapibar · 2021-09-15T17:44:16Z

start gradle check

opensearch-ci-bot · 2021-09-15T18:28:23Z

✅ Gradle Check success 7848646
Log 527

Reports 527

…project#1167) This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election. Issue Description --- The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master. There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

dblock · 2021-12-10T18:08:16Z

Do we want this in 1.3? @Bukhtawar

Bukhtawar · 2022-01-27T10:52:14Z

yes, will raise backporting for the same

…1269) * Introduce FS Health HEALTHY threshold to fail stuck node (#1167) This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election. Issue Description --- The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master. There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up settings Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up tests Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up tests Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up tests Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

Introduce FS Health HEALTHY threshold to fail stuck node

88f885b

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

Minor fix up

d1789a5

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

itiyamas suggested changes Aug 30, 2021

View reviewed changes

getsaurabh02 reviewed Aug 30, 2021

View reviewed changes

Review comments

c7c909c

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

Bukhtawar requested a review from itiyamas August 30, 2021 17:43

Review comments

bec0b33

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

itiyamas approved these changes Aug 30, 2021

View reviewed changes

Increasing refresh interval

fc17516

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

getsaurabh02 reviewed Aug 31, 2021

View reviewed changes

server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java Outdated Show resolved Hide resolved

Merge branch 'main' into fs

a908188

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

Merge branch 'main' into fs

7848646

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

adnapibar added v1.2.0 Issues related to version 1.2.0 bug Something isn't working labels Sep 16, 2021

adnapibar merged commit f7e2984 into opensearch-project:main Sep 17, 2021

anasalkouz mentioned this pull request Oct 4, 2021

[BUG] IO Freeze on Leader cause cluster publication to get stuck #1165

Open

nknize mentioned this pull request Oct 27, 2021

[CI] FsHealthServiceTests.testFailsHealthOnHungIOBeyondHealthyTimeout #1450

Closed

tlfeng mentioned this pull request Dec 10, 2021

Fix unit test testFailsHealthOnHungIOBeyondHealthyTimeout() by incresing the max waiting time before assertion #1692

Merged

5 tasks

dblock mentioned this pull request Feb 7, 2022

[Backport] Introduce FS Health HEALTHY threshold to fail stuck node #1269

Merged

5 tasks

dblock mentioned this pull request May 6, 2022

Adding @Bukhtawar to OpenSearch maintainers. #3231

Merged

1 task

Introduce FS Health HEALTHY threshold to fail stuck node #1167

Introduce FS Health HEALTHY threshold to fail stuck node #1167

Conversation

Bukhtawar commented Aug 28, 2021 • edited Loading

Description

Issues Resolved

Check List

opensearch-ci-bot commented Aug 28, 2021

opensearch-ci-bot commented Aug 28, 2021

opensearch-ci-bot commented Aug 28, 2021

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar Aug 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar Aug 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

itiyamas commented Aug 30, 2021

opensearch-ci-bot commented Aug 30, 2021

Bukhtawar commented Aug 31, 2021

opensearch-ci-bot commented Aug 31, 2021

opensearch-ci-bot commented Aug 31, 2021

opensearch-ci-bot commented Aug 31, 2021

itiyamas commented Aug 31, 2021

opensearch-ci-bot commented Aug 31, 2021

opensearch-ci-bot commented Sep 5, 2021

opensearch-ci-bot commented Sep 5, 2021

opensearch-ci-bot commented Sep 5, 2021

opensearch-ci-bot commented Sep 11, 2021

opensearch-ci-bot commented Sep 11, 2021

opensearch-ci-bot commented Sep 11, 2021

adnapibar commented Sep 14, 2021

opensearch-ci-bot commented Sep 14, 2021

opensearch-ci-bot commented Sep 15, 2021

opensearch-ci-bot commented Sep 15, 2021

opensearch-ci-bot commented Sep 15, 2021

adnapibar commented Sep 15, 2021

opensearch-ci-bot commented Sep 15, 2021

dblock commented Dec 10, 2021

Bukhtawar commented Jan 27, 2022

Bukhtawar commented Aug 28, 2021 •

edited

Loading

Bukhtawar Aug 30, 2021 •

edited

Loading

Bukhtawar Aug 30, 2021 •

edited

Loading