Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

Closed
Tracked by #7912
CraigMacomber opened this issue Sep 21, 2021 · 6 comments
Assignees
Milestone

Comments

@CraigMacomber
Copy link
Contributor

We have a scenario which sends a 1196984 byte op (measured based on webpack message size).

The op is not received back from the server, instead the client disconnects and reconnects in a loop forever, with no backoff. I believe each reconnect it tries to send the op again, which causes the disconnect and thus repeats.

I'm not seeing any errors logged when I reproduce this scenario, until eventually some summary related errors are printed, but I suspect thats more a symptom of endless reconnects than the large op.

One of my coworkers received a fluid:telemetry:DeltaManager error (DeltaConnectionFailuearToConnect) in this scenario, but I was unable to reproduce that: I suspect that was just one of the reconnects randomly failing.

I think this shows two bugs:

My understanding is that something client side is suppose to handle large ops, preventing ops over 1 MB from being sent.
I observed ops over 10kb getting chunked, however all the chunks are sent in the same websocket message (as strings inside json, so with extra escaping), so the actual websocket message size is increased by this chunking, not lowered.

I'm not sure what this chunking is supposed to do (It seems like it just adds string escaping overhead and complicates parsing the op: maybe its working as intended (and is doing something unrelated that I don't understand), or maybe its the cause of this bug.

I haven't tested with ops much larger than 1 MB (I don't currently have a larger scenario): its possible this issue only occurs for ops almost exactly one MB. In our case the op would likely be less than 1 MB if just looking at its string form, but the extra level of string escaping pushes it over (chunked op's are escaped twice compared to normal ops being escaped once, making about 20% of the failing op backslashes), so it could be a mismatch between client and server for which version to measure.

Using fluid-framework 0.47.0 in edge, with ODSP backend.

@ghost ghost added the triage label Sep 21, 2021
@DLehenbauer
Copy link
Contributor

FYI - @curtisman recalled that:

  • 1MB is the maximum Kafka message size
  • The feature to upload large ops out-of-band as blobs was removed

@DLehenbauer DLehenbauer self-assigned this Sep 22, 2021
@DLehenbauer
Copy link
Contributor

From the network payload + code inspection, I believe what is happening is that 'BatchManager' is box-caring the chunked messages into a single Socket.io message that exceeds an ODSP threshold.

@andre4i - Is there a way Whiteboard can enable your feature flag to confirm?

@vladsud
Copy link
Contributor

vladsud commented Sep 27, 2021

I've just opened #7599 before reading this issue :)
Yes, 1Mb is a hard limit. And client + server should behave better (see issue for more details).

Chunked messages could be submitted as separate socket.io messages. But not if chunked message is part of a batch - whole batch payload needs to be sent as one socket.io message to preserve batching semantics (all ops in batch are sequenced in one go, ensuring no gaps in sequence numbers). This is true today, and we will hit it even more with Andrei's changes for batches to be implicit at JS turn granularity.

I think the only real fix here (with current limitations) is to ensure messages are smaller in size, i.e. JS turn never produces 1M+ of content.
This might be hard to achieve, as single user action (copy / paste of large payload) may be over the limit.

I do not see general solution here. It feels like a solution needs to be similar to group ops in Sequence - a collection of ops that is grouped together that has single sequence number per group. If such thing was supported by runtime, then we would be able to break any > 1Mb batch of ops into chunks and submit chunks individually on socket, not relying on today's server support that is based on sending whole payload at once - it would not matter that chunks would be sequenced with gaps, as only sequence number of last chunk would determine the end of the "op" (and sequence number for all ops within batch) and all content ops would be processed at once.

If we add to this ability of a client to request ordering service to issue N sequence numbers (in a row) for an op (the last chunked op), then clients could assign such sequence numbers themselves to ops within a batch, and thus fully own batching implementation.

To sum it up, we need some substantial redesign here, and it will likely take months to implement end-to-end, so any existing scenario that hits that limit need to have back-up plan

@vladsud
Copy link
Contributor

vladsud commented Sep 29, 2021

Some key updates here:

  1. Per discussion (captured in 1M Kafka message limit #7599), runtime can easily avoid 1M limit on server while keeping batching semantics. All it takes is to ensure final messages (after chunking / strigify) are under 16K limit, not before we start these operations.
  2. This bug should continue to track reduction of overhead.
  3. There is a limit on socket.io that is also 1M - https://socket.io/docs/v4/server-options/#maxhttpbuffersize

@vladsud vladsud changed the title 1 MB op causes infinite reconnect loop 1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements Sep 29, 2021
@CraigMacomber
Copy link
Contributor Author

In the short term, we would benefit from a better failure mode (ex: client detects the situation, and crashes, losing the data: well communicated data loss with telemetry is better the infinite reconnect with hidden data loss and excessive bandwidth use).

@andre4i
Copy link
Contributor

andre4i commented Apr 11, 2022

#9243

@andre4i andre4i closed this as completed Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants