1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

CraigMacomber · 2021-09-21T22:21:55Z

We have a scenario which sends a 1196984 byte op (measured based on webpack message size).

The op is not received back from the server, instead the client disconnects and reconnects in a loop forever, with no backoff. I believe each reconnect it tries to send the op again, which causes the disconnect and thus repeats.

I'm not seeing any errors logged when I reproduce this scenario, until eventually some summary related errors are printed, but I suspect thats more a symptom of endless reconnects than the large op.

One of my coworkers received a fluid:telemetry:DeltaManager error (DeltaConnectionFailuearToConnect) in this scenario, but I was unable to reproduce that: I suspect that was just one of the reconnects randomly failing.

I think this shows two bugs:

Something is broken with large ops. 990 KB ops work fine: but ~1 MB ones don't.
Infinite reconnect, similar to Hitting 10K unsummarized ops results in client getting into infinite loop of reconnects #7137

My understanding is that something client side is suppose to handle large ops, preventing ops over 1 MB from being sent.
I observed ops over 10kb getting chunked, however all the chunks are sent in the same websocket message (as strings inside json, so with extra escaping), so the actual websocket message size is increased by this chunking, not lowered.

I'm not sure what this chunking is supposed to do (It seems like it just adds string escaping overhead and complicates parsing the op: maybe its working as intended (and is doing something unrelated that I don't understand), or maybe its the cause of this bug.

I haven't tested with ops much larger than 1 MB (I don't currently have a larger scenario): its possible this issue only occurs for ops almost exactly one MB. In our case the op would likely be less than 1 MB if just looking at its string form, but the extra level of string escaping pushes it over (chunked op's are escaped twice compared to normal ops being escaped once, making about 20% of the failing op backslashes), so it could be a mismatch between client and server for which version to measure.

Using fluid-framework 0.47.0 in edge, with ODSP backend.

DLehenbauer · 2021-09-22T17:20:25Z

FYI - @curtisman recalled that:

1MB is the maximum Kafka message size
The feature to upload large ops out-of-band as blobs was removed

DLehenbauer · 2021-09-22T21:50:42Z

From the network payload + code inspection, I believe what is happening is that 'BatchManager' is box-caring the chunked messages into a single Socket.io message that exceeds an ODSP threshold.

@andre4i - Is there a way Whiteboard can enable your feature flag to confirm?

vladsud · 2021-09-27T22:15:24Z

I've just opened #7599 before reading this issue :)
Yes, 1Mb is a hard limit. And client + server should behave better (see issue for more details).

Chunked messages could be submitted as separate socket.io messages. But not if chunked message is part of a batch - whole batch payload needs to be sent as one socket.io message to preserve batching semantics (all ops in batch are sequenced in one go, ensuring no gaps in sequence numbers). This is true today, and we will hit it even more with Andrei's changes for batches to be implicit at JS turn granularity.

I think the only real fix here (with current limitations) is to ensure messages are smaller in size, i.e. JS turn never produces 1M+ of content.
This might be hard to achieve, as single user action (copy / paste of large payload) may be over the limit.

I do not see general solution here. It feels like a solution needs to be similar to group ops in Sequence - a collection of ops that is grouped together that has single sequence number per group. If such thing was supported by runtime, then we would be able to break any > 1Mb batch of ops into chunks and submit chunks individually on socket, not relying on today's server support that is based on sending whole payload at once - it would not matter that chunks would be sequenced with gaps, as only sequence number of last chunk would determine the end of the "op" (and sequence number for all ops within batch) and all content ops would be processed at once.

If we add to this ability of a client to request ordering service to issue N sequence numbers (in a row) for an op (the last chunked op), then clients could assign such sequence numbers themselves to ops within a batch, and thus fully own batching implementation.

To sum it up, we need some substantial redesign here, and it will likely take months to implement end-to-end, so any existing scenario that hits that limit need to have back-up plan

vladsud · 2021-09-29T13:07:57Z

Some key updates here:

Per discussion (captured in 1M Kafka message limit #7599), runtime can easily avoid 1M limit on server while keeping batching semantics. All it takes is to ensure final messages (after chunking / strigify) are under 16K limit, not before we start these operations.
This bug should continue to track reduction of overhead.
There is a limit on socket.io that is also 1M - https://socket.io/docs/v4/server-options/#maxhttpbuffersize
- This raises two questions: why do we not hit it, and what to do about it, given that we can't maintain batch semantics in presence of such limit (see my earlier comment on that).
- Do we not hit it (and thus can survive for quite a long time here without expensive redesign) because socket.io has compression? See https://stackoverflow.com/questions/19298651/how-does-websocket-compress-messages & https://socket.io/blog/socket-io-1-4-0/ . We should definitely look deeper into that and understand better our limitations here and path forward to eventfully address them

CraigMacomber · 2021-09-29T17:44:32Z

In the short term, we would benefit from a better failure mode (ex: client detects the situation, and crashes, losing the data: well communicated data loss with telemetry is better the infinite reconnect with hidden data loss and excessive bandwidth use).

andre4i · 2022-04-11T21:10:13Z

#9243

ghost added the triage label Sep 21, 2021

DLehenbauer self-assigned this Sep 22, 2021

vladsud mentioned this issue Sep 27, 2021

1M Kafka message limit #7599

Closed

vladsud mentioned this issue Sep 27, 2021

Collection of 1.0 items to consider #6272

Closed

61 tasks

DLehenbauer mentioned this issue Sep 28, 2021

Optimization of delta messages (Placeholder) #7626

Closed

vladsud changed the title ~~1 MB op causes infinite reconnect loop~~ 1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements Sep 29, 2021

anthony-murphy added this to the October 2021 milestone Oct 4, 2021

ghost removed the triage label Oct 4, 2021

danielroney mentioned this issue Oct 20, 2021

Op batching work for 1.0 #7912

Closed

7 tasks

DLehenbauer modified the milestones: October 2021, Future Oct 26, 2021

andre4i mentioned this issue Nov 9, 2021

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

Closed

andre4i mentioned this issue Dec 3, 2021

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Closed

vladsud mentioned this issue Feb 4, 2022

1M limit epic #9023

Closed

andre4i closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

CraigMacomber commented Sep 21, 2021

DLehenbauer commented Sep 22, 2021

DLehenbauer commented Sep 22, 2021

vladsud commented Sep 27, 2021

vladsud commented Sep 29, 2021

CraigMacomber commented Sep 29, 2021

andre4i commented Apr 11, 2022

1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545

Comments

CraigMacomber commented Sep 21, 2021

DLehenbauer commented Sep 22, 2021

DLehenbauer commented Sep 22, 2021

vladsud commented Sep 27, 2021

vladsud commented Sep 29, 2021

CraigMacomber commented Sep 29, 2021

andre4i commented Apr 11, 2022