-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport: fix race sending RPC status that could lead to a panic #1687
Conversation
@menghanl Could please you review this? This is affecting etcd production users. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Write
should not be called after WriteStatus
.
Do you have any clue why Write
was called after WriteStatus
? We should probably look into fixing the problem in the caller.
transport/handler_server.go
Outdated
@@ -253,6 +254,13 @@ func (ht *serverHandlerTransport) writeCommonHeaders(s *Stream) { | |||
} | |||
|
|||
func (ht *serverHandlerTransport) Write(s *Stream, hdr []byte, data []byte, opts *Options) error { | |||
ht.mu.Lock() | |||
done := ht.streamDone | |||
ht.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If WriteStatus
is called after this unlock
, before ht.do
, the panic will still happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't WriteStatus
calls serialized in gRPC side? WriteStatus
is called once, and subsequent calls would exit on ht.streamDone==true
. Not sure how this triggers panics in WriteStatus
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriteStatus
is guaranteed to only execute once, but if this line of code is reached and THEN WriteStatus
is called and gets all the way to closing ht.writes
before the ht.do()
here, then we would end up with the same panic.
Should WriteStatus
call ht.Close
before closing ht.writes
? I think that would also fix the problem. Then the code inside do
will avoid writing to the closed channel.
Server stream is done but underlying stream layer in server application is not closed(or stopped) yet. Then server stream keeps receiving client writes, until server application rejects them. etcd minimizes this small time window, rejecting requests when error is returned from I think gRPC should handle it by ignoring |
Thanks for the investigation! In step 3, when the previous
I'm not sure if I understand this sentence correctly. |
Ok, I double-checked our code and confirm that we serialize all Lines 63 to 74 in a62701e
Basically, we are doing this:
Now it's possible that either Lines 649 to 652 in a62701e
Lines 689 to 691 in a62701e
Followed by What do you think? |
This appears critical for kubernetes users as well. Once this is available and we can get a new etcd release incorporating it, I'd like to fast track it into kubernetes. |
PTAL; #1687 (comment) is addressed. Thanks. |
WriteStatus can be called concurrently: one by SendMsg, the other by RecvMsg. Then, closing writes channel becomes racey without proper locking. Make transport closing synchronous in such case. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! We still need to do something similar for our server implementation in http2_server.go for the same kind of scenario.
Yeah, I will add similar locks around Thanks! |
I wouldn't worry about |
@menghanl Can this be backported with v1.7.x release, sometime next week? We've tried this with v1.8.x and tip, but both fail lots of etcd tests--we will eventually update to v1.8 or v1.9+, in the next few weeks. |
@gyuho Just did |
Awesome. Thanks for the quick release! |
Fix etcd-io/etcd#8904.
To reproduce, run these tests without this patch to
transport/handler_server.go
:UPDATE(2017-11-30):
WriteStatus
can be called concurrently: one bySendMsg
,the other by
RecvMsg
. Then, closingwrites
channelbecomes racey without proper locking.
Make transport closing synchronous in such case.