-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce FS Health HEALTHY threshold to fail stuck node #1167
Conversation
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success 88f885b |
✅ DCO Check Passed 88f885b |
✅ Gradle Precommit success 88f885b |
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ DCO Check Passed d1789a5 |
✅ Gradle Wrapper Validation success d1789a5 |
✅ Gradle Precommit success d1789a5 |
@@ -347,7 +403,7 @@ public void force(boolean metaData) throws IOException { | |||
|
|||
private static class FileSystemFsyncHungProvider extends FilterFileSystemProvider { | |||
|
|||
AtomicBoolean injectIOException = new AtomicBoolean(); | |||
AtomicBoolean injectIODelay = new AtomicBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Th delay is achieved by using Thread.sleep. You could use latching to make this more deterministic in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests actually does a latch, we are mimicking a stuck IO operations which is being achieved by causing the fsync operation to go into a sleep. I have however improved the mock to make it more deterministic based on your suggestion
statusInfo = new StatusInfo(HEALTHY, "health check disabled"); | ||
} else if (brokenLock) { | ||
statusInfo = new StatusInfo(UNHEALTHY, "health check failed due to broken node lock"); | ||
} else if (lastRunTimeMillis.get() > Long.MIN_VALUE && currentTimeMillisSupplier.getAsLong() - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't catch the if the first ever run for health check does not complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done!
|
||
@Nullable | ||
private volatile Set<Path> unhealthyPaths; | ||
|
||
public static final Setting<Boolean> ENABLED_SETTING = | ||
Setting.boolSetting("monitor.fs.health.enabled", true, Setting.Property.NodeScope, Setting.Property.Dynamic); | ||
public static final Setting<TimeValue> REFRESH_INTERVAL_SETTING = | ||
Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(120), TimeValue.timeValueMillis(1), | ||
Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(60), TimeValue.timeValueMillis(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to decrease the interval frequency, here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to checks to be more frequent to be able to catch issues faster
} else if (lastRunTimeMillis.get() > Long.MIN_VALUE && currentTimeMillisSupplier.getAsLong() - | ||
lastRunTimeMillis.get() > refreshInterval.millis() + healthyTimeoutThreshold.millis()) { | ||
statusInfo = new StatusInfo(UNHEALTHY, "healthy threshold breached"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this condition by having current run start time, instead of relying on the last run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Good point
public static final Setting<TimeValue> HEALTHY_TIMEOUT_SETTING = | ||
Setting.timeSetting("monitor.fs.health.healthy_timeout_threshold", TimeValue.timeValueSeconds(60), TimeValue.timeValueMillis(1), | ||
Setting.Property.NodeScope, Setting.Property.Dynamic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given FS Healthcheck has deeper impacts, having retries of lets say 3
makes sense to eliminate any transient issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would either delay the detection or increases chances of false positives. Maybe we can extend support in future. Not expanding the scope of this PR
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success c7c909c |
✅ DCO Check Passed c7c909c |
✅ Gradle Precommit success c7c909c |
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success bec0b33 |
✅ DCO Check Passed bec0b33 |
✅ Gradle Precommit success bec0b33 |
start gradle check |
|
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ DCO Check Passed fc17516 |
✅ Gradle Wrapper Validation success fc17516 |
✅ Gradle Precommit success fc17516 |
start gradle check |
server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java
Outdated
Show resolved
Hide resolved
✅ Gradle Wrapper Validation success 0e4c57a |
✅ DCO Check Passed 0e4c57a |
✅ Gradle Precommit success 0e4c57a |
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ Gradle Wrapper Validation success a908188 |
✅ DCO Check Passed a908188 |
✅ Gradle Precommit success a908188 |
start gradle check |
Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
✅ DCO Check Passed 7848646 |
✅ Gradle Wrapper Validation success 7848646 |
✅ Gradle Precommit success 7848646 |
start gradle check |
…project#1167) This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election. Issue Description --- The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master. There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Do we want this in 1.3? @Bukhtawar |
yes, will raise backporting for the same |
…1269) * Introduce FS Health HEALTHY threshold to fail stuck node (#1167) This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election. Issue Description --- The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master. There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress. Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up settings Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up tests Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up tests Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com> * Fix up tests Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
Signed-off-by: Bukhtawar Khan bukhtawa@amazon.com
Description
FS health should fail on stuck IO. This would cause node to stuck on IO get eventually removed from the cluster
Issues Resolved
#1165
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.