Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

Closed
AmandaCameron opened this issue Jun 24, 2016 · 9 comments

Comments

@AmandaCameron
Copy link

I've got a service written that I use to funnel alert notifications around my various infra, and I'm seeing that the listeners for this service's alerts is intermitently falling over and not spitting out any more alerts.

Best I can guess is that the stream.Recv() call isn't returning an error as it should be in this case.

The relevent code:

func dialAlerts(alertChan chan *alerts.Alert) {
    ctx, done := context.WithCancel(context.Background())
    defer done()

    cc, err := grpc.Dial(*alertsHost, grpc.WithInsecure())
    if err != nil {
        logger.Log("msg", "Failed to connect to alerts.",
            "error", err)
        os.Exit(1)
    }

    alertsCli = alerts.NewAlertsClient(cc)

    conn, err := alertsCli.ReceiveAlerts(ctx, &alerts.SubscriptionOpts{
        Tags: []string{
            "__all__",
        },
    })
    if err != nil {
        logger.Log("msg", "Failed to subscribe to alerts.",
            "error", err)
        os.Exit(1)
    }

    for {
        alert, err := conn.Recv()
        if err != nil {
            logger.Log("msg", "Failed to read from alerts.",
                "error", err)

            os.Exit(1)
        }

        alertChan <- alert
    }
}```
@iamqizhao
Copy link
Contributor

I am not sure I understand what exactly your question is. Do you mean "conn.Recv()" sometimes falls over with "os.Exit(1)"? If yes, this is expected, streaming RPC is NOT tolerant to the network errors.

@AmandaCameron
Copy link
Author

I mean the opposite, sometimes I'll just stop receiving messages from the
stream while not being given any kind of error, I suspect it's during some
kind of network fault.

On Fri, Jun 24, 2016, 14:44 Qi Zhao notifications@github.com wrote:

I am not sure I understand what exactly your question is. Do you mean
"conn.Recv()" sometimes falls over with "os.Exit(1)"? If yes, this is
expected, streaming RPC is NOT tolerant to the network errors.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#734 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAG_zn1Qn9xm7r6ULLK991akO2aLfDe5ks5qPCWPgaJpZM4I-AmI
.

@iamqizhao
Copy link
Contributor

I am still confused. What do you mean "stop receiving messages"?

@AmandaCameron
Copy link
Author

Events being published to the stream by the server are never appearing on
the client, dispite being dispatched from the server's side.

On Fri, Jun 24, 2016 at 3:17 PM Qi Zhao notifications@github.com wrote:

I am still confused. What do you mean "stop receiving messages"?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#734 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAG_zrNMKJQ45BxzphegTDspt64smohFks5qPC04gaJpZM4I-AmI
.

@iamqizhao
Copy link
Contributor

Do you mean "conn.Recv()" is blocking even though you know the server sends something back to the client? This should not happen. Can you send me a reproduction?

@AmandaCameron
Copy link
Author

I believe the issue is that there's some kind of network event happening that is causing the TCP stream to be made invalid, and the error isn't being detected / propogated through the stack.

I had talked about this briefly in the GRPC irc channel and @ejona86 pointed me at grpc/grpc-java#1648 -- which I'm planning on attempting as a fix for now.

@iamqizhao
Copy link
Contributor

perhaps it is the reason. But even in that case, conn.Recv() will eventually error out when TCP user time out trigger (but maybe up to tens of minutes). One workaround would be hosting the health checking service on your server (https://github.com/grpc/grpc-go/blob/master/health/health.go) so that your client can periodically probe the server and cancel the pending streaming rpc if the server becomes unreachable.

If this is the case, do u mind closing this issue and creating a counterpart issue like grpc/grpc-java#1648 to make your requirement crystal clear?

@ejona86
Copy link
Member

ejona86 commented Jun 27, 2016

@iamqizhao, the Recv() will not error if there is no write. This issue effectively boils down to the same as #536.

And note that health checking is not a full solution, because the health checking RPCs could go on a different TCP connection (let's say GOAWAY was received) and then a NAT drops the TCP connection with the long-lived RPC due to inactivity but keeps the TCP connection for the health checking.

TCP keepalive actually solves the problem, although we're still avoiding it due to DoS, configuration, and waste concerns.

rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Nov 30, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Dec 1, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
rhysh added a commit to rhysh/grpc-go that referenced this issue Dec 1, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for grpc#632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in grpc#734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates grpc#632
Updates grpc#734
iamqizhao pushed a commit that referenced this issue Dec 1, 2016
Remove the add and cancel methods of quotaPool. Their use is not required, and
leads to logical races when used concurrently from multiple goroutines. Rename
the reset method to add.

The typical way that a goroutine claims quota is to call the add method and
then to select on the channel returned by the acquire method. If two
goroutines are both trying to claim quota from a single quotaPool, the second
call to the add method can happen before the first attempt to read from the
channel. When that happens the second goroutine to attempt the channel read
will end up waiting for a very long time, in spite of its efforts to prepare
the channel for reading.

The quotaPool will always behave correctly when any positive quota is on the
channel rather than stored in the struct field. In the opposite case, when
positive quota is only in the struct field and not on the channel, users of
the quotaPool can fail to access available quota. Err on the side of storing
any positive quota in the channel.

This includes a reproducer for #632, which fails on many runs with this
package at v1.0.4. The frequency of the test failures depends on how stressed
the server is, since it's now effectively checking for weird interleavings of
goroutines. It passes reliably with these changes to the transport package.

The behavior described in #734 (an RPC with a streaming response hangs
unexpectedly) matches what I've seen in my programs, and what I see in the
test case added here. If it's a logical flow control bug, this change may well
fix it.

Updates #632
Updates #734
@hsaliak
Copy link

hsaliak commented May 17, 2017

@dfawley with keepalive support and #536 closed, I am closing this as well.
@AmandaCameron please re-open if you feel that this is still an issue.

@hsaliak hsaliak closed this as completed May 17, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants