-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: Canceling Watch()
doesn't send &pb.WatchCancelRequest
#9416
Comments
I think this is a valid use case for multiplexed watch stream requests + gRPC proxy. Right now, only way to cancel mvcc watchers is closing the whole watch stream. What do you think? @heyitsanthony @xiang90 |
@yudai that's how it works today. A server-side mechanism to LRU evict watches in the v3rpc layer might be easier/cover more cases than reworking the client's cancel path. |
Thanks for the comments. Please correct me if I misunderstand something below. I think it's fine and working as expected that a The issue I wanted to mention was that In the v3rpc layer, I therefore think we already have a proper implementation for substream canceling in the server side, and the only missing part is in the client side. The client library simply doesn't send I assume this issue could be critical for gRPC proxy as well. I actually realized this issue when I was processing watch requests in our private gRPC proxy. Since clients don't send substream cancel requests, our proxy cannot release its internal resources for watch substreams as long as watch streams are alive. I think that the same thing should happen for gRPC proxy. The mvcc layer also keeps unnecessary In my opinion, it would be better to make the client send cancel requests when users cancel contexts. It's pretty confusing for users that canceling contexts is not always canceling watch. Sending canceling requests on context canceling would be more straight forward and complies better with the defined protocol. |
Created reproduction code: package main
import "github.com/coreos/etcd/clientv3"
import (
"context"
"log"
"time"
)
func main() {
cli, err := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 5 * time.Second,
})
if err != nil {
log.Fatal(err)
}
log.Printf("Start")
cli.Watch(context.Background(), "/")
for {
ctx, cancel := context.WithCancel(context.Background())
cli.Watch(ctx, "/")
cancel()
}
log.Printf("Done")
} You can confirm |
@yudai @heyitsanthony I will run some benchmarks to measure the overhead of canceling each watcher on closing channel receive, from client-side. Or investigate if it can be easily handled in storage layer. |
We seem to be experiencing a similar issue with our current production setup running on etcd v3.2.24. The number of open watches on the server continues to climb until we hit between ~23k and ~32k open watchers. Once the number of etcd open watches flat lines, etcd will fail to send watches out to clients and we see a sharp rise in Slow Watchers. While diving into the code, we noticed that the watch request on the etcd server was not closed until the client got an event from the server after the client context was canceled. As an attempted workaround, we configured the clients to use Has there been any progress on determining if there is significant overhead in canceling a watch request when the context gets canceled on the client side? Code executed when the client context is canceled Lines 513 to 520 in e06761e
Code executed when an event is received from the server after the context is closed Lines 489 to 496 in e06761e
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
@xiang90 Any chance this is actively being worked on? This has caused us production issues several times before and we'll invest the time to fix it if no one else is working on it. |
Can there be an overhead? Once a watch is closed on a client, we will eventually send a cancel request when the next message arrives. So, there are two cases by cancelling proactively:
I'm working on a PR to do what others above have suggested. |
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
The new fork at github.com/cloudwan/etcd_for_gohan contains backported to v3.3.18 fix for etcd-io/etcd#9416 . We use watches heavily and with that fix, the memory usage in etcd server no longer grows uncontrollably. Regenerated go.sum while at it.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there. This a port to v3.3.8 of a change made by @jackkleeman in 87aa5a9.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
The new fork at github.com/cloudwan/etcd_for_gohan contains backported to v3.3.18 fix for etcd-io/etcd#9416 . We use watches heavily and with that fix, the memory usage in etcd server no longer grows uncontrollably. Regenerated go.sum while at it.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Currently, watch cancel requests are only sent to the server after a message comes through on a watch where the client has cancelled. This means that cancelled watches that don't receive any new messages are never cancelled; they persist for the lifetime of the client stream. This has negative connotations for locking applications where a watch may observe a key which might never change again after cancellation, leading to many accumulating watches on the server. By cancelling proactively, in most cases we simply move the cancel request to happen earlier, and additionally we solve the case where the cancel request would never be sent. Fixes etcd-io#9416 Heavy inspiration drawn from the solutions proposed there.
Summary
Canceling a context that has been used to invoke
Watch()
is supposed to stop subscribing events for theWatch()
, meaning that canceling should release resources in the server side.However,
clientv3
doesn't send a&pb.WatchCancelRequest
message to the server when a substream is canceled, which leads resource leak in the side until the parent watch stream itself is closed.Issue
Tested with the master branch (00b8423).
watchGrpcStream
serves watch substreams inserveSubstream
and accepts context canceling for them there as well (usingws.initReq.ctx
).When
ws.initReq.ctx
is canceled, it eventually sends thews
tow.closingc
, and thews
is handled by thewatchGrpcStream.run()
. Then it callsw.closeSubstream(ws)
and the substream is removed from internalw.substreams
. (the client side is fine, it releases the resources for substreams)However, this canceling procedure for substreams doesn't send any cancel messages to the server. The server therefore doesn't know a substream is canceled by the client at this moment.
When the watched key of a substream is actively updated, an event to the key eventually resolve this inconsistency here.
On the other hand, when the key is not updated at all, the resources for the substream in the server side kept forever until the watch stream itself is canceled or the client is dropped.
This behavior can be an issue when there are long-running clients that dynamically watch/unwatch different keys that are not actively updated after unwatching.
Reproducing
You can reproduce the situation by the following code.
Proposal
I think we can simply send a cancel request when
w.closingc
receives a closing stream.https://github.com/coreos/etcd/blob/release-3.2/clientv3/watch.go#L513
Adding something like this?
The text was updated successfully, but these errors were encountered: