Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: fix race sending RPC status that could lead to a panic #1687

Merged
merged 1 commit into from
Nov 30, 2017
Merged

transport: fix race sending RPC status that could lead to a panic #1687

merged 1 commit into from
Nov 30, 2017

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Nov 24, 2017

Fix etcd-io/etcd#8904.

To reproduce, run these tests without this patch to transport/handler_server.go:

go test -v -run TestHandlerTransport_HandleStreams_WriteStatusWrite

=== RUN   TestHandlerTransport_HandleStreams_WriteStatusWrite
--- PASS: TestHandlerTransport_HandleStreams_WriteStatusWrite (0.00s)
panic: send on closed channel

goroutine 8 [running]:
google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc420164230, 0xc42007e410, 0x1012a08, 0x10)
	/Users/gyuho/go/src/google.golang.org/grpc/transport/handler_server.go:170 +0x115
google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc420164230, 0xc420098640, 0xc420016760, 0x3, 0x8, 0xc420016768, 0x4, 0x8, 0xc420016770, 0x0, ...)
	/Users/gyuho/go/src/google.golang.org/grpc/transport/handler_server.go:256 +0xe6
google.golang.org/grpc/transport.TestHandlerTransport_HandleStreams_WriteStatusWrite.func1(0xc4200115c0, 0xc420098640)
	/Users/gyuho/go/src/google.golang.org/grpc/transport/handler_server_test.go:431 +0x2d8
created by google.golang.org/grpc/transport.testHandlerTransportHandleStreams.func1
	/Users/gyuho/go/src/google.golang.org/grpc/transport/handler_server_test.go:399 +0x45
exit status 2
FAIL	google.golang.org/grpc/transport	0.019s

UPDATE(2017-11-30):

WriteStatus can be called concurrently: one by SendMsg,
the other by RecvMsg. Then, closing writes channel
becomes racey without proper locking.

Make transport closing synchronous in such case.

@gyuho gyuho closed this Nov 24, 2017
@gyuho gyuho reopened this Nov 24, 2017
@gyuho
Copy link
Contributor Author

gyuho commented Nov 28, 2017

@menghanl Could please you review this? This is affecting etcd production users. Thanks!

Copy link
Contributor

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Write should not be called after WriteStatus.

Do you have any clue why Write was called after WriteStatus? We should probably look into fixing the problem in the caller.

@@ -253,6 +254,13 @@ func (ht *serverHandlerTransport) writeCommonHeaders(s *Stream) {
}

func (ht *serverHandlerTransport) Write(s *Stream, hdr []byte, data []byte, opts *Options) error {
ht.mu.Lock()
done := ht.streamDone
ht.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WriteStatus is called after this unlock, before ht.do, the panic will still happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't WriteStatus calls serialized in gRPC side? WriteStatus is called once, and subsequent calls would exit on ht.streamDone==true. Not sure how this triggers panics in WriteStatus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteStatus is guaranteed to only execute once, but if this line of code is reached and THEN WriteStatus is called and gets all the way to closing ht.writes before the ht.do() here, then we would end up with the same panic.

Should WriteStatus call ht.Close before closing ht.writes? I think that would also fix the problem. Then the code inside do will avoid writing to the closed channel.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 29, 2017

@menghanl

(ht *serverHandlerTransport) Write call after (ht *serverHandlerTransport) WriteStatus on the same *serverHandlerTransport is possible when:

  1. (ss *serverStream==0x123) SendMsg exits from some kind of error

    grpc-go/stream.go

    Lines 658 to 667 in a62701e

    hdr, data, err := encode(ss.codec, m, ss.cp, outPayload, ss.comp)
    if err != nil {
    return err
    }
    if len(data) > ss.maxSendMessageSize {
    return Errorf(codes.ResourceExhausted, "trying to send message larger than max (%d vs. %d)", len(data), ss.maxSendMessageSize)
    }
    if err := ss.t.Write(ss.s, hdr, data, &transport.Options{Last: false}); err != nil {
    return toRPCErr(err)
    }
    Note: Currently, etcd does not log these errors from (x *watchWatchServer) Send (watch is an etcd stream API) yet, so we are not sure what the error was.
  2. (ss *serverStream==0x123) SendMsg's defer calls ss.t.WriteStatus

    grpc-go/stream.go

    Lines 649 to 652 in a62701e

    if err != nil && err != io.EOF {
    st, _ := status.FromError(toRPCErr(err))
    ss.t.WriteStatus(ss.s, st)
    }
  3. The error from (ss *serverStream==0x123) SendMsg will be returned to etcd server. We expect server be rejecting requests to this server stream, right away. However, before the server stream is closed and before (s *grpc.Server) serveStreams calls transport.ServerTransport.Close, concurrent client writes call (ss *serverStream==0x123) SendMsg.
  4. This time, no error and proceed to ss.t.Write

    grpc-go/stream.go

    Lines 658 to 667 in a62701e

    hdr, data, err := encode(ss.codec, m, ss.cp, outPayload, ss.comp)
    if err != nil {
    return err
    }
    if len(data) > ss.maxSendMessageSize {
    return Errorf(codes.ResourceExhausted, "trying to send message larger than max (%d vs. %d)", len(data), ss.maxSendMessageSize)
    }
    if err := ss.t.Write(ss.s, hdr, data, &transport.Options{Last: false}); err != nil {
    return toRPCErr(err)
    }
  5. At the moment, ss.t.WriteStatus had already been called in gRPC-side but transport.ServerTransport.Close has not been called. Now, ss.t.Write triggers panic: send on closed channel here
    select {
    case <-ht.closedCh:
    return ErrConnClosing
    default:
    select {
    case ht.writes <- fn:

Server stream is done but underlying stream layer in server application is not closed(or stopped) yet. Then server stream keeps receiving client writes, until server application rejects them.

etcd minimizes this small time window, rejecting requests when error is returned from SendMsg. But, there's still a chance that one request slips through (and crashes the whole server). In application layer, there's no easy to stop all client requests atomically, when SendMsg fails.

I think gRPC should handle it by ignoring Writes after WriteStatus.

@menghanl
Copy link
Contributor

Thanks for the investigation!

In step 3, when the previous Send failed with an error, why would Send be called again later? Is there another caller of Send?
(In the code you linked for (x *watchWatchServer) Send, the sendLoop returns when Send fails.)

concurrent client writes call (ss *serverStream==0x123) SendMsg.

I'm not sure if I understand this sentence correctly.
Concurrent SendMsg on the same stream is not safe.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 29, 2017

@menghanl

Ok, I double-checked our code and confirm that we serialize all Writes to the gRPC stream, so etcd's doing the right thing. There's no other Send caller to the same stream. But there's concurrent caller to Recv.

grpc-go/stream.go

Lines 63 to 74 in a62701e

// It's safe to have a goroutine calling SendMsg and another goroutine calling
// recvMsg on the same stream at the same time.
// But it is not safe to call SendMsg on the same stream in different goroutines.
SendMsg(m interface{}) error
// RecvMsg blocks until it receives a message or the stream is
// done. On client side, it returns io.EOF when the stream is done. On
// any other error, it aborts the stream and returns an RPC status. On
// server side, it simply returns the error to the caller.
// It's safe to have a goroutine calling SendMsg and another goroutine calling
// recvMsg on the same stream at the same time.
// But it is not safe to call RecvMsg on the same stream in different goroutines.
RecvMsg(m interface{}) error

Basically, we are doing this:

It's safe to have a goroutine calling SendMsg and another goroutine calling recvMsg on the same stream at the same time. (reference: https://github.com/coreos/etcd/blob/b21180d198f6c87d711980686af388f71c665126/etcdserver/api/v3rpc/watch.go#L131-L144)

Now it's possible that either WriteStatus in SendMsg or RecvMsg gets called first:

grpc-go/stream.go

Lines 649 to 652 in a62701e

if err != nil && err != io.EOF {
st, _ := status.FromError(toRPCErr(err))
ss.t.WriteStatus(ss.s, st)
}

grpc-go/stream.go

Lines 689 to 691 in a62701e

if err != nil && err != io.EOF {
st, _ := status.FromError(toRPCErr(err))
ss.t.WriteStatus(ss.s, st)

Followed by Write in the other goroutine. In this case, WriteStatus and subsequent Write call happen on the same *serverHandlerTransport.

What do you think?

@jpbetz
Copy link
Contributor

jpbetz commented Nov 29, 2017

This appears critical for kubernetes users as well. Once this is available and we can get a new etcd release incorporating it, I'd like to fast track it into kubernetes.

@gyuho gyuho changed the title transport: fix racey stream writes following "WriteStatus" transport: make connection closing atomic in WriteStatus Nov 30, 2017
@gyuho gyuho changed the title transport: make connection closing atomic in WriteStatus transport: make connection closing synchronous in WriteStatus Nov 30, 2017
@gyuho
Copy link
Contributor Author

gyuho commented Nov 30, 2017

@menghanl @dfawley

PTAL; #1687 (comment) is addressed.

Thanks.

WriteStatus can be called concurrently: one by SendMsg,
the other by RecvMsg. Then, closing writes channel
becomes racey without proper locking.

Make transport closing synchronous in such case.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! We still need to do something similar for our server implementation in http2_server.go for the same kind of scenario.

@gyuho
Copy link
Contributor Author

gyuho commented Nov 30, 2017

@dfawley

WriteStatus is guaranteed to only execute once

We still need to do something similar for our server implementation in http2_server.go for the same kind of scenario.

Yeah, I will add similar locks around http2_server.go in a separate PR, unless someone is working on it.

Thanks!

@dfawley
Copy link
Member

dfawley commented Nov 30, 2017

I wouldn't worry about http2_server.go; we are making some bigger changes there that should actually render this problem obsolete. Thanks!

@dfawley dfawley changed the title transport: make connection closing synchronous in WriteStatus transport: fix race sending RPC status that could lead to a panic Nov 30, 2017
@dfawley dfawley merged commit c6b4608 into grpc:master Nov 30, 2017
@gyuho gyuho deleted the fix-transport branch November 30, 2017 22:00
@gyuho
Copy link
Contributor Author

gyuho commented Dec 1, 2017

@menghanl Can this be backported with v1.7.x release, sometime next week?

We've tried this with v1.8.x and tip, but both fail lots of etcd tests--we will eventually update to v1.8 or v1.9+, in the next few weeks.

menghanl pushed a commit that referenced this pull request Dec 4, 2017
)

WriteStatus can be called concurrently: one by SendMsg,
the other by RecvMsg. Then, closing writes channel
becomes racey without proper locking.

Make transport closing synchronous in such case.
menghanl pushed a commit that referenced this pull request Dec 4, 2017
)

WriteStatus can be called concurrently: one by SendMsg,
the other by RecvMsg. Then, closing writes channel
becomes racey without proper locking.

Make transport closing synchronous in such case.
@menghanl
Copy link
Contributor

menghanl commented Dec 4, 2017

@gyuho Just did v1.7.4 and v1.8.1 with this fix.

@gyuho
Copy link
Contributor Author

gyuho commented Dec 4, 2017

Awesome. Thanks for the quick release!

@dfawley dfawley added this to the 1.9 Release milestone Jan 2, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC v1.7.3 transport "panic: send on closed channel" on *serverHandlerTransport
4 participants