-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up Node#close. #39317
Clean up Node#close. #39317
Conversation
`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.
Pinging @elastic/es-core-infra |
@elasticmachine run elasticsearch-ci/packaging-sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments.
} catch (IOException ex) { | ||
throw new ElasticsearchException("failed to stop node", ex); | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); | ||
LogManager.getLogger(Bootstrap.class).warn("Thread got interrupted while waiting for the node to shutdown."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we revert the order here (i.e. first log the message, then restore interrupt status)?
} | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); | ||
LogManager.getLogger(Bootstrap.class).warn("Thread got interrupted while waiting for the node to shutdown."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here I suggest to revert the order.
lifecycle.moveToStarted(); | ||
for (LifecycleListener listener : listeners) { | ||
listener.afterStart(); | ||
synchronized (lifecycle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can become tricky to synchronize on an object that can also be directly accessed by subclasses. The state of Lifecycle
has only one field that is declared volatile
and as far as I could see all usages in subclasses only query the state and only the base class modifies it. From that perspective we are safe but it is easy to introduce subtle bugs. I wonder whether in the future it would make sense to think about encapsulating Lifecycle
in AbstractLifecycleComponent
to make this a bit more robust.
@@ -308,6 +312,15 @@ protected void doClose() throws IOException { | |||
indicesRefCount.decRef(); | |||
} | |||
|
|||
/** | |||
* Wait for this {@link IndicesService} to be effectively closed. When this returns, all shard and shard stores are closed and all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is not entirely accurate? The guarantees described here only hold if this method returns true
. If it returns false
it timed out waiting for the condition to happen.
@@ -183,8 +184,15 @@ public void run() { | |||
IOUtils.close(node, spawner); | |||
LoggerContext context = (LoggerContext) LogManager.getContext(false); | |||
Configurator.shutdown(context); | |||
if (node != null && node.awaitClose(10, TimeUnit.SECONDS) == false) { | |||
throw new IOException("Node didn't stop within 10 seconds. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think IllegalStateException
might be more appropriate?
@@ -267,6 +275,12 @@ private void start() throws NodeValidationException { | |||
static void stop() throws IOException { | |||
try { | |||
IOUtils.close(INSTANCE.node, INSTANCE.spawner); | |||
if (INSTANCE.node != null && INSTANCE.node.awaitClose(10, TimeUnit.SECONDS) == false) { | |||
throw new IOException("Node didn't stop within 10 seconds. Any outstanding requests or tasks might get killed."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think IllegalStateException
might be more appropriate?
|
||
|
||
ThreadPool threadPool = injector.getInstance(ThreadPool.class); | ||
final boolean terminated = ThreadPool.terminate(threadPool, timeout, timeUnit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit odd to call a mutating method in an await
-style method but I do see the reason why it is needed here.
@@ -1049,6 +1049,13 @@ public void close() throws IOException { | |||
closed.set(true); | |||
markNodeDataDirsAsPendingForWipe(node); | |||
node.close(); | |||
try { | |||
if (node.awaitClose(10, TimeUnit.SECONDS) == false) { | |||
throw new IOException("Node didn't close within 10 seconds."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think IllegalStateException
might be more appropriate?
Thanks @danielmitterdorfer for looking. I wonder whether you have thoughts on the overall approach as well? |
The overall approach makes sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! LGTM
Thanks @danielmitterdorfer ! |
`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.
`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.
The changes in elastic#39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. elastic#41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes elastic#41683
The changes in #39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. #41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes #41683
The changes in #39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. #41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes #41683
`Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.
The changes in elastic#39317 brought to light some concurrency issues in the close method of Recyclers as we do not wait for threads running in the threadpool to be finished prior to the closing of the PageCacheRecycler and the Recyclers that are used internally. elastic#41695 was opened to address the concurrent close issues but upon review, the closing of these classes is not really needed as the instances should be become available for garbage collection once there is no longer a reference to the closed node. Closes elastic#41683
Node#close
is pretty hard to rely on today:if threads are still not terminated after 10 seconds
This commit makes
IOException
s propagated and splitsNode#close
intoNode#close
andNode#awaitClose
so that the decision what to do if a nodetakes too long to close can be done on top of
Node#close
.It also adds synchronization to lifecycle transitions to make them atomic. I
don't think it is a source of problems today, but it makes things easier to
reason about.