-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: If startup hangs (esp on preconditions), shutdown server correctly #4889
fix: If startup hangs (esp on preconditions), shutdown server correctly #4889
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
How'd you test it out? Were you able to reproduce locally? |
I introduced a precondition that never completes. Then I ran the server and would then issue a kill command, which would call the shutdown method. I pretty much verified that the startup method completes immediately and that the jvm quits within some reasonable window, e.g. 5-10 seconds after that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review.
Nothing major in here, just wanted to flag up some potential issues and some suggests.
@@ -391,6 +404,13 @@ private KsqlFailedPrecondition(final String message) { | |||
} | |||
} | |||
|
|||
static final class AbortApplicationStartException extends Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using a checked exception here? We don't really use checked exceptions any more. They're a PITA! Why not extend KsqlServerException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Extended KsqlServerException
.
@@ -53,6 +55,7 @@ public static void retryWithBackoff( | |||
final int initialWaitMs, | |||
final int maxWaitMs, | |||
final Runnable runnable, | |||
final AtomicBoolean stopRetrying, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rather than taking a AtomicBoolean
, why not take a Supplier<Boolean>
? The latter is more general and flexible, allowing method that returns a boolean to be used.
You can then wire this up by passing stopRetrying::get
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
try { | ||
RetryUtil.retryWithBackoff(3, 1, 100, runnable, sleep, stopRetrying, Collections.emptyList()); | ||
fail("retry should have thrown"); | ||
} catch (final RuntimeException e) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this with less code with:
assertThrows(
RuntimeException.class,
() -> RetryUtil.retryWithBackoff(3, 1, 100, runnable, sleep, stopRetrying, Collections.emptyList())
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
predicates | ||
); | ||
} catch (KsqlFailedPrecondition e) { | ||
log.error("Finished Precondition retrying", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message seems a little confusing. It's an error, but the text sounds more like a success... maybe
log.error("Failed to meet preconditions. Existing...", e);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Changed.
// startup thread if it's been hanging. | ||
shuttingDown.set(true); | ||
if (startAsyncThread != null) { | ||
startAsyncThread.interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're interrupting the start up thread, then there's a high chance we'll see interrupt exceptions being thrown. Yet I don't see anywhere that we catch this, now expected, exception. This may lead to the exception being logged out, which isn't ideal if its expected.
Maybe the code in startAsync
need to also catch interrupted exceptions and log out the Aborting application start
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, it's caught in RetryUtil
:
duration -> {
try {
Thread.sleep(duration);
} catch (final InterruptedException e) {
log.debug("retryWithBackoff interrupted while sleeping");
}
},
I believe it would only throw this exception if someone did something that could be interrupted (like sleeping), and since it's a checked exception, they would have to catch it explicitly (or let it bubble up explicitly). In this case, I would want them to catch it and then throw a AbortApplicationStartException
which is already handled. So I don't think catching InterruptedException
s at the startAsync
level would be required. Is that reasonable?
@@ -415,7 +421,7 @@ public void shouldNotInitializeUntilPreconditionsChecked() { | |||
} | |||
|
|||
@Test | |||
public void shouldConfigureRocksDBConfigSetter() { | |||
public void shouldConfigureRocksDBConfigSetter() throws AbortApplicationStartException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make AbortApplicationStartException
a non-checked exception we can drop all these throws
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@big-andy-coates Replied to all of your comments:
@@ -391,6 +404,13 @@ private KsqlFailedPrecondition(final String message) { | |||
} | |||
} | |||
|
|||
static final class AbortApplicationStartException extends Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Extended KsqlServerException
.
predicates | ||
); | ||
} catch (KsqlFailedPrecondition e) { | ||
log.error("Finished Precondition retrying", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Changed.
// startup thread if it's been hanging. | ||
shuttingDown.set(true); | ||
if (startAsyncThread != null) { | ||
startAsyncThread.interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, it's caught in RetryUtil
:
duration -> {
try {
Thread.sleep(duration);
} catch (final InterruptedException e) {
log.debug("retryWithBackoff interrupted while sleeping");
}
},
I believe it would only throw this exception if someone did something that could be interrupted (like sleeping), and since it's a checked exception, they would have to catch it explicitly (or let it bubble up explicitly). In this case, I would want them to catch it and then throw a AbortApplicationStartException
which is already handled. So I don't think catching InterruptedException
s at the startAsync
level would be required. Is that reasonable?
@@ -415,7 +421,7 @@ public void shouldNotInitializeUntilPreconditionsChecked() { | |||
} | |||
|
|||
@Test | |||
public void shouldConfigureRocksDBConfigSetter() { | |||
public void shouldConfigureRocksDBConfigSetter() throws AbortApplicationStartException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -53,6 +55,7 @@ public static void retryWithBackoff( | |||
final int initialWaitMs, | |||
final int maxWaitMs, | |||
final Runnable runnable, | |||
final AtomicBoolean stopRetrying, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
try { | ||
RetryUtil.retryWithBackoff(3, 1, 100, runnable, sleep, stopRetrying, Collections.emptyList()); | ||
fail("retry should have thrown"); | ||
} catch (final RuntimeException e) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Description
Often when the server is starting up, it waits for preconditions to be met. If the preconditions can't be met, it tries indefinitely. If the server shuts down, the JVM will not stop due to this unending preconditions effort.
This change allows the preconditions to be interrupted during shutdown so that the application and JVM exit properly.
Fixes #4779
Testing done
mvn package
Reviewer checklist