-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple mechanism to catch a runaway container that is not making progress #9243
Simple mechanism to catch a runaway container that is not making progress #9243
Conversation
packages/test/test-end-to-end-tests/src/test/payloadSize.spec.ts
Outdated
Show resolved
Hide resolved
⯅ @fluid-example/bundle-size-tests: +3.75 KB
Baseline commit: 85bf932 |
…ost event when we're halfway through
@@ -1024,6 +1031,9 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents> | |||
(this.mc.config.getBoolean(useDataStoreAliasingKey) ?? false) || | |||
(runtimeOptions.useDataStoreAliasing ?? false); | |||
|
|||
this.maxConsecutiveReconnects = | |||
this.mc.config.getNumber(maxConsecutiveReconnectsKey) ?? this.defaultMaxConsecutiveReconnects; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note -- This config stuff is so nice!
} | ||
|
||
if (!this.pendingStateManager.hasPendingMessages()) { | ||
// If there are no pending messages, we can always reconnect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to compare this runtime-driven approach v. a loader-driven approach (which would be my default since I know that code better, ha).
As it is, if there are no local changes but the websocket is doomed for some reason (e.g. service is unhealthy and failing immediately on every connection attempt), it will stay stuck. In other words, this fix only addresses the case where the websocket connection is faulting due to the local messages (e.g. over 1MB).
I am tracking cases where the initial websocket connection never succeeds, which feels more likely to be unrelated to local ops, but maybe I'm wrong. I suppose we can move forward with this change for the sake of the 1MB op problem, but I'll be very curious to see if there are other classes of failures that are not related to the runtime layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned somewhere that it will address also #7137, so yes, it addresses class of issues.
I'd also likely start with loader layer, but I'm fine with runtime as well. Historically we see benefits of having less code in loader layer as it's the slowest layer in terms of propagating changes.
I think the best outcome is to do it through adapters - i.e., implementation that is not part of either layer, but can be included (or excluded) as we see fit. For example, it can be a proxy object that implements IRuntime (i.e. sits between loader and runtimte) or maybe driver (sits between real driver and loader). Example to consider - BlobAggregationStorage, though it still has smallish integration points in runtime.
I like this direction because layers continue to have small number of responsibilities - the fewer, the better!
This is mostly food for thought, for future :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about this again - we should not hesitate to work with the server folks on problems like this. I'm expecting that will be one of the first follow-ups as I finish my analysis of hung websocket connections. This PR is a reasonable client-side mitigation of a class of problems that originate on the client.
Meanwhile, if the server is having trouble getting the websocket off the ground for an otherwise healthy client, they should be in a better position to detect and react than the client would be.
// to better identify false positives, if any. If the rate of this event | ||
// matches `MaxReconnectsWithNoProgress`, we can safely cut down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
…ress (microsoft#9243) Part of microsoft#9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.
…ress (microsoft#9243) Part of microsoft#9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.
Part of #9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.
In a nutshell, we track how many times the runtime is attempting to reconnect consecutively without processing any local ops. If we hit the limit, we close the container as this is a strong indicator that ops are not going through, and we are hitting an endless loop of recovery attempts.
The limit is configurable via a feature gate. To disable the feature, we can set the limit to a negative value.