-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor WebSockets to fix window of duplicate delivery after reconect #123
Conversation
…nect Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #123 +/- ##
===========================================
+ Coverage 99.98% 100.00% +0.01%
===========================================
Files 78 79 +1
Lines 6435 6473 +38
===========================================
+ Hits 6434 6473 +39
+ Misses 1 0 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing stuff!
type streamState struct { | ||
wlmCounter int64 | ||
inflight *roundTrip | ||
conns []*webSocketConnection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be useful making this a map with the id being the key as we iterate over this a few times to delete and add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually changed away from a map in this PR, because the performance sensitive part is a copy of the list that has to happen on every broadcast. That can be a simple mem-copy now.
The other side of the coin is as you point out, the need to iterate on the lifecycle action of close.
Let me know what you think after considering, as it was a big thinking point as I worked on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - so we copy so that we don't retain a mutex.Lock in case a new connection is added or deleted whilst we are in the middle of a broadcast. Interesting to me that even with a copy()
the channels in each ws connection still work, tested in Go Playground. Happy for this to be a copy instead of a map 😃
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
payload.GetBatchHeader().BatchNumber = ss.wlmCounter | ||
payload.GetBatchHeader().Stream = stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems odd to me that it's a Get and we update in place. I guess my java side is twitching a little but it might be fine for go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because Go doesn't have an object model to allow something like class inheritance (which would be available in Java), so when combining with Generics this is the best patterned I've found for mandating embedding of a sub-structure into a parent.
Really GetBatchHeader()
is GetBatchHeaderPtr()
- but that felt unnecessary.
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@@ -0,0 +1,40 @@ | |||
// Copyright © 2023 Kaleido, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the linter doesn't check test files 🙂
c.server.connectionClosed(c) | ||
log.L(c.ctx).Infof("Disconnected") | ||
} | ||
|
||
func (c *webSocketConnection) sender() { | ||
defer c.close() | ||
buildCases := func() []reflect.SelectCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am so happy to see this go! 👋
The
WebSocketChannels
interface has been around a long time, all the way back to the early days of EthConnect.It provides a protocol for:
streams
onto a single WebSocketBut there is a problem with the complex way that the channels are used as a code interface, per this function (which is derived from similar code from EthConnect, FFTM/EVMConnect and FireFly core):
firefly-common/pkg/eventstreams/websockets.go
Lines 78 to 112 in badf065
Because after a disconnect when messages arrive that are candidates for delivery, a
send
wiil be queued into the channel when there are no WebSockets connected to receive it. Then when a WebSocket connects, that websocket will receive the payload/batch queued into the channel for it. However, theAttemptDispatch
will immediately fail inwaitForAck
(without waiting for an ack at all) because the WebSocket ID has changed, even thought the disconnection happened some time ago.This does not affect the resilience of the idempotent delivery interface between FFTM and FireFly, but with the new
eventstreams
interface in microservices it is highly likely to occur and be frustrating to applications consuming the stream.So this PR proposes to fix the problem by simplifying the API interface of the
wsserver
package (and resulting in a re-write of the stream selection and dispatch logic ofwsserver
).The code used to be a complex pull model, where each webscoket connection tracks the streams it's started on, and pulls from a separate channel for each stream. This required uncommon code constructs, like dynamic
select
against an array of potential channels.This PR proposes a much more straight forward model where a
RoundTrip
object is defined and the parent server-level code keeps track of the connections available for each stream - so it can pick a different one for each round-trip. It then pushes that send into a unique send channel for that webscoket. The parent is responsible for making sure each stream can have just one round-trip in-flight at any given time, which is cancelled if the in-bound WebSocket processing it disconnects.After the change, the consumer code that uses the
wsserver
API looks like:firefly-common/pkg/eventstreams/websockets.go
Lines 77 to 95 in 07fafee
The problem being solve is illustrated in the below log snippet.
Additional change - extensibility for JSON based query
Sorry @awrichar, I meant to split this out to a separate PR.
I've added an extra interface to the
ffapi.QueryJSON
interface, to allow extensibility for a use case where some of the fields are names that need to be resolved to ID before the query runs.The interface gives you access to the whole query tree at the level you are at, which in my use case is because you might need to look at (/require) extra
equal: [...]
options at the same level in the tree to scope the name lookup.This extensibility is only added for the complex JSON format of query, because the problem is much simpler in the path + query-field case and I'm not sure there's an obvious way to join the logic (without a big restructure).