Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce FS Health HEALTHY threshold to fail stuck node #1167

Merged
merged 13 commits into from
Sep 17, 2021

Conversation

Bukhtawar
Copy link
Collaborator

@Bukhtawar Bukhtawar commented Aug 28, 2021

Signed-off-by: Bukhtawar Khan bukhtawa@amazon.com

Description

FS health should fail on stuck IO. This would cause node to stuck on IO get eventually removed from the cluster

Issues Resolved

#1165

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 88f885b

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 88f885b

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 88f885b

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed d1789a5

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success d1789a5

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success d1789a5

@@ -347,7 +403,7 @@ public void force(boolean metaData) throws IOException {

private static class FileSystemFsyncHungProvider extends FilterFileSystemProvider {

AtomicBoolean injectIOException = new AtomicBoolean();
AtomicBoolean injectIODelay = new AtomicBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Th delay is achieved by using Thread.sleep. You could use latching to make this more deterministic in tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests actually does a latch, we are mimicking a stuck IO operations which is being achieved by causing the fsync operation to go into a sleep. I have however improved the mock to make it more deterministic based on your suggestion

statusInfo = new StatusInfo(HEALTHY, "health check disabled");
} else if (brokenLock) {
statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock");
} else if (lastRunTimeMillis.get() > Long.MIN_VALUE && currentTimeMillisSupplier.getAsLong() -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't catch the if the first ever run for health check does not complete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done!


@Nullable
private volatile Set<Path> unhealthyPaths;

public static final Setting<Boolean> ENABLED_SETTING =
Setting.boolSetting("monitor.fs.health.enabled", true, Setting.Property.NodeScope, Setting.Property.Dynamic);
public static final Setting<TimeValue> REFRESH_INTERVAL_SETTING =
Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(120), TimeValue.timeValueMillis(1),
Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(60), TimeValue.timeValueMillis(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to decrease the interval frequency, here?

Copy link
Collaborator Author

@Bukhtawar Bukhtawar Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to checks to be more frequent to be able to catch issues faster

Comment on lines 149 to 151
} else if (lastRunTimeMillis.get() > Long.MIN_VALUE && currentTimeMillisSupplier.getAsLong() -
lastRunTimeMillis.get() > refreshInterval.millis() + healthyTimeoutThreshold.millis()) {
statusInfo = new StatusInfo(UNHEALTHY, "healthy threshold breached");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this condition by having current run start time, instead of relying on the last run time?

Copy link
Collaborator Author

@Bukhtawar Bukhtawar Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Good point

Comment on lines +96 to +98
public static final Setting<TimeValue> HEALTHY_TIMEOUT_SETTING =
Setting.timeSetting("monitor.fs.health.healthy_timeout_threshold", TimeValue.timeValueSeconds(60), TimeValue.timeValueMillis(1),
Setting.Property.NodeScope, Setting.Property.Dynamic);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given FS Healthcheck has deeper impacts, having retries of lets say 3 makes sense to eliminate any transient issues?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would either delay the detection or increases chances of false positives. Maybe we can extend support in future. Not expanding the scope of this PR

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success c7c909c

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed c7c909c

@Bukhtawar Bukhtawar requested a review from itiyamas August 30, 2021 17:43
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success c7c909c

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success bec0b33

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed bec0b33

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success bec0b33

@itiyamas
Copy link
Contributor

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure bec0b33
Log 446

Reports 446

@Bukhtawar
Copy link
Collaborator Author

Tests with failures:
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search/160_exists_query/Test exists query on half_float field in empty index}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=suggest/30_context/Category suggest context from path should work}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search/issue9606/Test search_type=dfs_query_and_fetch not supported from REST layer}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=exists/61_realtime_refresh_with_types/Realtime Refresh}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=search.aggregation/40_range/Date Range Missing}
 - org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=mget/16_basic_with_types/Basic multi-get}
Tests with failures:
 - org.opensearch.plugins.InstallPluginCommandTests.testOfficialPluginStaging {p0=sun.nio.fs.LinuxFileSystem@f978b40 p1=org.opensearch.plugins.InstallPluginCommandTests$$Lambda$212/0x0000000800d73ba0@24fc089f}
 - org.opensearch.plugins.InstallPluginCommandTests.classMethod
Tests with failures:
 - org.opensearch.http.nio.NioHttpServerTransportTests.testLargeCompressedResponse

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed fc17516

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success fc17516

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success fc17516

@itiyamas
Copy link
Contributor

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure fc17516
Log 447

Reports 447

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 0e4c57a

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 0e4c57a

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 0e4c57a

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success a908188

@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed a908188

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success a908188

@adnapibar
Copy link
Contributor

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure a908188
Log 523

Reports 523

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@opensearch-ci-bot
Copy link
Collaborator

✅   DCO Check Passed 7848646

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Wrapper Validation success 7848646

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Precommit success 7848646

@adnapibar
Copy link
Contributor

start gradle check

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 7848646
Log 527

Reports 527

@adnapibar adnapibar added v1.2.0 Issues related to version 1.2.0 bug Something isn't working labels Sep 16, 2021
@adnapibar adnapibar merged commit f7e2984 into opensearch-project:main Sep 17, 2021
Bukhtawar added a commit to Bukhtawar/OpenSearch that referenced this pull request Sep 22, 2021
…project#1167)

This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election.

Issue Description
---
The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.

There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
@dblock
Copy link
Member

dblock commented Dec 10, 2021

Do we want this in 1.3? @Bukhtawar

@Bukhtawar
Copy link
Collaborator Author

yes, will raise backporting for the same

dblock pushed a commit that referenced this pull request Feb 7, 2022
…1269)

* Introduce FS Health HEALTHY threshold to fail stuck node (#1167)

This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election.

Issue Description
---
The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.

There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up settings

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up tests

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up tests

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>

* Fix up tests

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v1.2.0 Issues related to version 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants