Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple mechanism to catch a runaway container that is not making progress #9243

Merged
merged 13 commits into from
Mar 1, 2022

Conversation

andre4i
Copy link
Contributor

@andre4i andre4i commented Feb 25, 2022

Part of #9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.

In a nutshell, we track how many times the runtime is attempting to reconnect consecutively without processing any local ops. If we hit the limit, we close the container as this is a strong indicator that ops are not going through, and we are hitting an endless loop of recovery attempts.

The limit is configurable via a feature gate. To disable the feature, we can set the limit to a negative value.

@andre4i andre4i requested review from a team as code owners February 25, 2022 20:10
@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc labels Feb 25, 2022
@github-actions github-actions bot removed the area: tests Tests to add, test infrastructure improvements, etc label Feb 25, 2022
@andre4i andre4i requested a review from vladsud February 25, 2022 22:43
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Feb 25, 2022

@fluid-example/bundle-size-tests: +3.75 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 387.63 KB 388.38 KB +769 Bytes
containerRuntime.js 188.19 KB 188.94 KB +769 Bytes
loader.js 159.13 KB 159.13 KB No change
map.js 203.27 KB 204.03 KB +769 Bytes
matrix.js 297.8 KB 298.55 KB +769 Bytes
odspDriver.js 158.9 KB 158.9 KB No change
odspPrefetchSnapshot.js 46.3 KB 46.3 KB No change
sharedString.js 317.96 KB 318.71 KB +769 Bytes
Total Size 1.75 MB 1.75 MB +3.75 KB

Baseline commit: 85bf932

Generated by 🚫 dangerJS against f0ada75

@@ -1024,6 +1031,9 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents>
(this.mc.config.getBoolean(useDataStoreAliasingKey) ?? false) ||
(runtimeOptions.useDataStoreAliasing ?? false);

this.maxConsecutiveReconnects =
this.mc.config.getNumber(maxConsecutiveReconnectsKey) ?? this.defaultMaxConsecutiveReconnects;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note -- This config stuff is so nice!

}

if (!this.pendingStateManager.hasPendingMessages()) {
// If there are no pending messages, we can always reconnect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to compare this runtime-driven approach v. a loader-driven approach (which would be my default since I know that code better, ha).

As it is, if there are no local changes but the websocket is doomed for some reason (e.g. service is unhealthy and failing immediately on every connection attempt), it will stay stuck. In other words, this fix only addresses the case where the websocket connection is faulting due to the local messages (e.g. over 1MB).

I am tracking cases where the initial websocket connection never succeeds, which feels more likely to be unrelated to local ops, but maybe I'm wrong. I suppose we can move forward with this change for the sake of the 1MB op problem, but I'll be very curious to see if there are other classes of failures that are not related to the runtime layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned somewhere that it will address also #7137, so yes, it addresses class of issues.
I'd also likely start with loader layer, but I'm fine with runtime as well. Historically we see benefits of having less code in loader layer as it's the slowest layer in terms of propagating changes.

I think the best outcome is to do it through adapters - i.e., implementation that is not part of either layer, but can be included (or excluded) as we see fit. For example, it can be a proxy object that implements IRuntime (i.e. sits between loader and runtimte) or maybe driver (sits between real driver and loader). Example to consider - BlobAggregationStorage, though it still has smallish integration points in runtime.

I like this direction because layers continue to have small number of responsibilities - the fewer, the better!

This is mostly food for thought, for future :)

Copy link
Member

@markfields markfields Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about this again - we should not hesitate to work with the server folks on problems like this. I'm expecting that will be one of the first follow-ups as I finish my analysis of hung websocket connections. This PR is a reasonable client-side mitigation of a class of problems that originate on the client.

Meanwhile, if the server is having trouble getting the websocket off the ground for an otherwise healthy client, they should be in a better position to detect and react than the client would be.

Comment on lines +1484 to +1485
// to better identify false positives, if any. If the rate of this event
// matches `MaxReconnectsWithNoProgress`, we can safely cut down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@andre4i andre4i merged commit df53d66 into microsoft:main Mar 1, 2022
andre4i added a commit to andre4i/FluidFramework that referenced this pull request Mar 3, 2022
…ress (microsoft#9243)

Part of microsoft#9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.
andre4i added a commit to andre4i/FluidFramework that referenced this pull request Mar 3, 2022
…ress (microsoft#9243)

Part of microsoft#9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.
andre4i added a commit that referenced this pull request Mar 3, 2022
…ress (#9243) (#9321)

Part of #9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants