Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: If startup hangs (esp on preconditions), shutdown server correctly #4889

Conversation

AlanConfluent
Copy link
Member

Description

Often when the server is starting up, it waits for preconditions to be met. If the preconditions can't be met, it tries indefinitely. If the server shuts down, the JVM will not stop due to this unending preconditions effort.

This change allows the preconditions to be interrupted during shutdown so that the application and JVM exit properly.

Fixes #4779

Testing done

mvn package

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@AlanConfluent AlanConfluent requested a review from rodesai March 24, 2020 23:08
@AlanConfluent AlanConfluent requested a review from a team as a code owner March 24, 2020 23:08
Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rodesai
Copy link
Contributor

rodesai commented Mar 25, 2020

How'd you test it out? Were you able to reproduce locally?

@AlanConfluent
Copy link
Member Author

How'd you test it out? Were you able to reproduce locally?

I introduced a precondition that never completes. Then I ran the server and would then issue a kill command, which would call the shutdown method. I pretty much verified that the startup method completes immediately and that the jvm quits within some reasonable window, e.g. 5-10 seconds after that.

@AlanConfluent AlanConfluent merged commit 20c4b59 into confluentinc:master Mar 25, 2020
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review.

Nothing major in here, just wanted to flag up some potential issues and some suggests.

@@ -391,6 +404,13 @@ private KsqlFailedPrecondition(final String message) {
}
}

static final class AbortApplicationStartException extends Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using a checked exception here? We don't really use checked exceptions any more. They're a PITA! Why not extend KsqlServerException?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Extended KsqlServerException.

@@ -53,6 +55,7 @@ public static void retryWithBackoff(
final int initialWaitMs,
final int maxWaitMs,
final Runnable runnable,
final AtomicBoolean stopRetrying,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rather than taking a AtomicBoolean, why not take a Supplier<Boolean>? The latter is more general and flexible, allowing method that returns a boolean to be used.

You can then wire this up by passing stopRetrying::get.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.

Comment on lines +100 to +104
try {
RetryUtil.retryWithBackoff(3, 1, 100, runnable, sleep, stopRetrying, Collections.emptyList());
fail("retry should have thrown");
} catch (final RuntimeException e) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this with less code with:

assertThrows(
   RuntimeException.class,
   () -> RetryUtil.retryWithBackoff(3, 1, 100, runnable, sleep, stopRetrying, Collections.emptyList())
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

predicates
);
} catch (KsqlFailedPrecondition e) {
log.error("Finished Precondition retrying", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message seems a little confusing. It's an error, but the text sounds more like a success... maybe

log.error("Failed to meet preconditions. Existing...", e);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Changed.

// startup thread if it's been hanging.
shuttingDown.set(true);
if (startAsyncThread != null) {
startAsyncThread.interrupt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're interrupting the start up thread, then there's a high chance we'll see interrupt exceptions being thrown. Yet I don't see anywhere that we catch this, now expected, exception. This may lead to the exception being logged out, which isn't ideal if its expected.

Maybe the code in startAsync need to also catch interrupted exceptions and log out the Aborting application start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, it's caught in RetryUtil:

duration -> {
          try {
            Thread.sleep(duration);
          } catch (final InterruptedException e) {
            log.debug("retryWithBackoff interrupted while sleeping");
          }
        },

I believe it would only throw this exception if someone did something that could be interrupted (like sleeping), and since it's a checked exception, they would have to catch it explicitly (or let it bubble up explicitly). In this case, I would want them to catch it and then throw a AbortApplicationStartException which is already handled. So I don't think catching InterruptedExceptions at the startAsync level would be required. Is that reasonable?

@@ -415,7 +421,7 @@ public void shouldNotInitializeUntilPreconditionsChecked() {
}

@Test
public void shouldConfigureRocksDBConfigSetter() {
public void shouldConfigureRocksDBConfigSetter() throws AbortApplicationStartException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make AbortApplicationStartException a non-checked exception we can drop all these throws...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

@AlanConfluent AlanConfluent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@big-andy-coates Replied to all of your comments:

#4907

@@ -391,6 +404,13 @@ private KsqlFailedPrecondition(final String message) {
}
}

static final class AbortApplicationStartException extends Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Extended KsqlServerException.

predicates
);
} catch (KsqlFailedPrecondition e) {
log.error("Finished Precondition retrying", e);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Changed.

// startup thread if it's been hanging.
shuttingDown.set(true);
if (startAsyncThread != null) {
startAsyncThread.interrupt();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, it's caught in RetryUtil:

duration -> {
          try {
            Thread.sleep(duration);
          } catch (final InterruptedException e) {
            log.debug("retryWithBackoff interrupted while sleeping");
          }
        },

I believe it would only throw this exception if someone did something that could be interrupted (like sleeping), and since it's a checked exception, they would have to catch it explicitly (or let it bubble up explicitly). In this case, I would want them to catch it and then throw a AbortApplicationStartException which is already handled. So I don't think catching InterruptedExceptions at the startAsync level would be required. Is that reasonable?

@@ -415,7 +421,7 @@ public void shouldNotInitializeUntilPreconditionsChecked() {
}

@Test
public void shouldConfigureRocksDBConfigSetter() {
public void shouldConfigureRocksDBConfigSetter() throws AbortApplicationStartException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -53,6 +55,7 @@ public static void retryWithBackoff(
final int initialWaitMs,
final int maxWaitMs,
final Runnable runnable,
final AtomicBoolean stopRetrying,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.

Comment on lines +100 to +104
try {
RetryUtil.retryWithBackoff(3, 1, 100, runnable, sleep, stopRetrying, Collections.emptyList());
fail("retry should have thrown");
} catch (final RuntimeException e) {
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KSQL may get stuck shutting down if it's stuck in tryStart
3 participants