-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Transport Stopped Exception #48930
Fix Transport Stopped Exception #48930
Conversation
When a node shutdowns, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates elastic#42612
Pinging @elastic/es-distributed (:Distributed/CRUD) |
Turns out new exception was unnecessary. The `Transport is stopped` messages were hardcoded in a few more places, fixed those too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few random comments :)
server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java
Outdated
Show resolved
Hide resolved
final boolean nodeIsClosing = | ||
cause instanceof NodeClosedException | ||
|| ExceptionsHelper.isTransportStoppedForAction(cause, "internal:cluster/shard/failure"); | ||
final boolean nodeIsClosing = cause instanceof NodeClosedException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use the ExceptionsHelper.unwrap(e, NodeClosedException.class) != null
pattern here too to be a little more resilient to future changes (and present unexpected paths that get us here ...?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this too, but ended up keeping this as is, since the unwrap's are different in that unwrapCause
only unwraps ElasticsearchWrapperException
, whereas unwrap
unwraps all causes. It could be important in case we end up unwrapping a deep exception from another host? Do you think we should try to follow your suggestion anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah let's not do that. I def. doesn't look impossible for that exception from another node making it here (since it's not trivial to tell it the suggestion is pointless in the first place :)).
.../elasticsearch/action/support/replication/TransportReplicationActionRetryOnClosedNodeIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceTests.java
Outdated
Show resolved
Hide resolved
.../elasticsearch/action/support/replication/TransportReplicationActionRetryOnClosedNodeIT.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
final boolean nodeIsClosing = | ||
cause instanceof NodeClosedException | ||
|| ExceptionsHelper.isTransportStoppedForAction(cause, "internal:cluster/shard/failure"); | ||
final boolean nodeIsClosing = cause instanceof NodeClosedException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah let's not do that. I def. doesn't look impossible for that exception from another node making it here (since it's not trivial to tell it the suggestion is pointless in the first place :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplification of exception handling in this PR a lot. I've left one comment, looking good o.w.
AFAICS, this exception is only bubbled up locally on a node, so we don't need to care about BWC (i.e. be able to detect the old TransportException bubbling up from other nodes).
server/src/main/java/org/elasticsearch/transport/TransportService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When a node shuts down, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates elastic#42612
When a node shuts down, `TransportService` moves to stopped state and then closes connections. If a request is done in between, an exception was thrown that was not retried in replication actions. Now throw a wrapped `NodeClosedException` exception instead, which is correctly handled in replication action. Fixed other usages too. Relates #42612
When a node shuts down,
TransportService
moves to stopped state andthen closes connections. If a request is done in between, an exception
was thrown that was not retried in replication actions. Now throw a
wrapped
NodeClosedException
exception instead, which is correctlyhandled in replication action. Fixed other usages too.
Relates #42612