New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Simple mechanism to catch a runaway container that is not making progress #9243

Merged

andre4i merged 13 commits into microsoft:main from andre4i:break-infinite-reconnections-2

Mar 1, 2022

Contributor

andre4i commented Feb 25, 2022 •

edited

Loading

Part of #9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.

In a nutshell, we track how many times the runtime is attempting to reconnect consecutively without processing any local ops. If we hit the limit, we close the container as this is a strong indicator that ops are not going through, and we are hitting an endless loop of recovery attempts.

The limit is configurable via a feature gate. To disable the feature, we can set the limit to a negative value.

andre4i added 7 commits

February 24, 2022 13:28


          Break infinite reconnections


          Update test to make it fail

08132d4


          Fix when we reset the counter

838c6b4


          Fix comment

76f2bc7


          Simplify config logic, add tests


          Fix logic when checking for pending states

d7d40be


          Change feature gate name, add test for it

63d68a7

andre4i requested review from a team as code owners

February 25, 2022 20:10

github-actions bot added area: runtime area: tests labels

andre4i commented

View reviewed changes

packages/test/test-end-to-end-tests/src/test/payloadSize.spec.ts Outdated Show resolved Hide resolved

vladsud reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud approved these changes

View reviewed changes


          Some PR feedback

d27c717

github-actions bot removed the area: tests label

andre4i requested a review from vladsud

February 25, 2022 22:43

Collaborator

msfluid-bot commented Feb 25, 2022 •

edited

Loading

⯅ @fluid-example/bundle-size-tests: +3.75 KB

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	387.63 KB	388.38 KB	⯅ +769 Bytes
containerRuntime.js	188.19 KB	188.94 KB	⯅ +769 Bytes
loader.js	159.13 KB	159.13 KB	■ No change
map.js	203.27 KB	204.03 KB	⯅ +769 Bytes
matrix.js	297.8 KB	298.55 KB	⯅ +769 Bytes
odspDriver.js	158.9 KB	158.9 KB	■ No change
odspPrefetchSnapshot.js	46.3 KB	46.3 KB	■ No change
sharedString.js	317.96 KB	318.71 KB	⯅ +769 Bytes
Total Size	1.75 MB	1.75 MB	⯅ +3.75 KB

Baseline commit: 85bf932

Generated by 🚫 dangerJS against f0ada75


          Merge branch 'main' into break-infinite-reconnections-2

c6d3cae

vladsud reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud approved these changes

View reviewed changes

andre4i added 2 commits

March 1, 2022 09:16


          Some PR feedback, reset counter when there are no pending messages, p…

7489f85

…ost event when we're halfway through


          Merge branch 'main' into break-infinite-reconnections-2

945c127

markfields reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts

                           (this.mc.config.getBoolean(useDataStoreAliasingKey) ?? false) ||
                           (runtimeOptions.useDataStoreAliasing ?? false);
+                      this.maxConsecutiveReconnects =
+                          this.mc.config.getNumber(maxConsecutiveReconnectsKey) ?? this.defaultMaxConsecutiveReconnects;

Member

markfields Mar 1, 2022

Side note -- This config stuff is so nice!

markfields reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved


          update comment

8aba869

markfields reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts

+                      }
+                      if (!this.pendingStateManager.hasPendingMessages()) {
+                          // If there are no pending messages, we can always reconnect

Member

markfields Mar 1, 2022

I'm trying to compare this runtime-driven approach v. a loader-driven approach (which would be my default since I know that code better, ha).

As it is, if there are no local changes but the websocket is doomed for some reason (e.g. service is unhealthy and failing immediately on every connection attempt), it will stay stuck. In other words, this fix only addresses the case where the websocket connection is faulting due to the local messages (e.g. over 1MB).

I am tracking cases where the initial websocket connection never succeeds, which feels more likely to be unrelated to local ops, but maybe I'm wrong. I suppose we can move forward with this change for the sake of the 1MB op problem, but I'll be very curious to see if there are other classes of failures that are not related to the runtime layer.

Contributor

vladsud Mar 1, 2022

I mentioned somewhere that it will address also #7137, so yes, it addresses class of issues.
I'd also likely start with loader layer, but I'm fine with runtime as well. Historically we see benefits of having less code in loader layer as it's the slowest layer in terms of propagating changes.

I think the best outcome is to do it through adapters - i.e., implementation that is not part of either layer, but can be included (or excluded) as we see fit. For example, it can be a proxy object that implements IRuntime (i.e. sits between loader and runtimte) or maybe driver (sits between real driver and loader). Example to consider - BlobAggregationStorage, though it still has smallish integration points in runtime.

I like this direction because layers continue to have small number of responsibilities - the fewer, the better!

This is mostly food for thought, for future :)

Member

markfields Mar 1, 2022 •

edited

Loading

I was also thinking about this again - we should not hesitate to work with the server folks on problems like this. I'm expecting that will be one of the first follow-ups as I finish my analysis of hung websocket connections. This PR is a reasonable client-side mitigation of a class of problems that originate on the client.

Meanwhile, if the server is having trouble getting the websocket off the ground for an otherwise healthy client, they should be in a better position to detect and react than the client would be.

markfields reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts

Comment on lines +1484 to +1485

		// to better identify false positives, if any. If the rate of this event
		// matches `MaxReconnectsWithNoProgress`, we can safely cut down

Member

markfields Mar 1, 2022

Nice

markfields reviewed

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

markfields approved these changes

View reviewed changes

vladsud mentioned this pull request

Hitting 10K unsummarized ops results in client getting into infinite loop of reconnects #7137

Closed


          count -> attempts

f0ada75

andre4i merged commit df53d66 into microsoft:main

This was referenced Mar 1, 2022

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

Closed

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Closed

andre4i added a commit to andre4i/FluidFramework that referenced this pull request


          Simple mechanism to catch a runaway container that is not making prog…

d7d6fa8

…ress (microsoft#9243)

Part of microsoft#9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.

andre4i added a commit to andre4i/FluidFramework that referenced this pull request


          Simple mechanism to catch a runaway container that is not making prog…

5d57547

…ress (microsoft#9243)

Part of microsoft#9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.

This was referenced Mar 3, 2022

Simple mechanism to catch a runaway container that is not making progress - backport 0.56 #9320

Merged

Simple mechanism to catch a runaway container that is not making progress - backport 0.57 #9321

Merged

andre4i added a commit that referenced this pull request


          Simple mechanism to catch a runaway container that is not making prog…

b8af760

…ress (#9243) (#9321)

Part of #9023 as this behavior is always observed when we hit the socket.io payload size limit and the container enters an endless reconnect loop.

andre4i mentioned this pull request

1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

Closed

vladsud mentioned this pull request

Too many short "write" connections #10022

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels