-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1 MB op causes infinite reconnect loop: chunking / op packing is very inefficient, violating 16K op size requirements #7545
Comments
FYI - @curtisman recalled that:
|
From the network payload + code inspection, I believe what is happening is that 'BatchManager' is box-caring the chunked messages into a single Socket.io message that exceeds an ODSP threshold. @andre4i - Is there a way Whiteboard can enable your feature flag to confirm? |
I've just opened #7599 before reading this issue :) Chunked messages could be submitted as separate socket.io messages. But not if chunked message is part of a batch - whole batch payload needs to be sent as one socket.io message to preserve batching semantics (all ops in batch are sequenced in one go, ensuring no gaps in sequence numbers). This is true today, and we will hit it even more with Andrei's changes for batches to be implicit at JS turn granularity. I think the only real fix here (with current limitations) is to ensure messages are smaller in size, i.e. JS turn never produces 1M+ of content. I do not see general solution here. It feels like a solution needs to be similar to group ops in Sequence - a collection of ops that is grouped together that has single sequence number per group. If such thing was supported by runtime, then we would be able to break any > 1Mb batch of ops into chunks and submit chunks individually on socket, not relying on today's server support that is based on sending whole payload at once - it would not matter that chunks would be sequenced with gaps, as only sequence number of last chunk would determine the end of the "op" (and sequence number for all ops within batch) and all content ops would be processed at once. If we add to this ability of a client to request ordering service to issue N sequence numbers (in a row) for an op (the last chunked op), then clients could assign such sequence numbers themselves to ops within a batch, and thus fully own batching implementation. To sum it up, we need some substantial redesign here, and it will likely take months to implement end-to-end, so any existing scenario that hits that limit need to have back-up plan |
Some key updates here:
|
In the short term, we would benefit from a better failure mode (ex: client detects the situation, and crashes, losing the data: well communicated data loss with telemetry is better the infinite reconnect with hidden data loss and excessive bandwidth use). |
We have a scenario which sends a 1196984 byte op (measured based on webpack message size).
The op is not received back from the server, instead the client disconnects and reconnects in a loop forever, with no backoff. I believe each reconnect it tries to send the op again, which causes the disconnect and thus repeats.
I'm not seeing any errors logged when I reproduce this scenario, until eventually some summary related errors are printed, but I suspect thats more a symptom of endless reconnects than the large op.
One of my coworkers received a fluid:telemetry:DeltaManager error (DeltaConnectionFailuearToConnect) in this scenario, but I was unable to reproduce that: I suspect that was just one of the reconnects randomly failing.
I think this shows two bugs:
My understanding is that something client side is suppose to handle large ops, preventing ops over 1 MB from being sent.
I observed ops over 10kb getting chunked, however all the chunks are sent in the same websocket message (as strings inside json, so with extra escaping), so the actual websocket message size is increased by this chunking, not lowered.
I'm not sure what this chunking is supposed to do (It seems like it just adds string escaping overhead and complicates parsing the op: maybe its working as intended (and is doing something unrelated that I don't understand), or maybe its the cause of this bug.
I haven't tested with ops much larger than 1 MB (I don't currently have a larger scenario): its possible this issue only occurs for ops almost exactly one MB. In our case the op would likely be less than 1 MB if just looking at its string form, but the extra level of string escaping pushes it over (chunked op's are escaped twice compared to normal ops being escaped once, making about 20% of the failing op backslashes), so it could be a mismatch between client and server for which version to measure.
Using fluid-framework 0.47.0 in edge, with ODSP backend.
The text was updated successfully, but these errors were encountered: