Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

AmandaCameron · 2016-06-24T18:26:37Z

I've got a service written that I use to funnel alert notifications around my various infra, and I'm seeing that the listeners for this service's alerts is intermitently falling over and not spitting out any more alerts.

Best I can guess is that the stream.Recv() call isn't returning an error as it should be in this case.

The relevent code:

func dialAlerts(alertChan chan *alerts.Alert) {
    ctx, done := context.WithCancel(context.Background())
    defer done()

    cc, err := grpc.Dial(*alertsHost, grpc.WithInsecure())
    if err != nil {
        logger.Log("msg", "Failed to connect to alerts.",
            "error", err)
        os.Exit(1)
    }

    alertsCli = alerts.NewAlertsClient(cc)

    conn, err := alertsCli.ReceiveAlerts(ctx, &alerts.SubscriptionOpts{
        Tags: []string{
            "__all__",
        },
    })
    if err != nil {
        logger.Log("msg", "Failed to subscribe to alerts.",
            "error", err)
        os.Exit(1)
    }

    for {
        alert, err := conn.Recv()
        if err != nil {
            logger.Log("msg", "Failed to read from alerts.",
                "error", err)

            os.Exit(1)
        }

        alertChan <- alert
    }
}```

iamqizhao · 2016-06-24T18:44:22Z

I am not sure I understand what exactly your question is. Do you mean "conn.Recv()" sometimes falls over with "os.Exit(1)"? If yes, this is expected, streaming RPC is NOT tolerant to the network errors.

AmandaCameron · 2016-06-24T19:04:11Z

I mean the opposite, sometimes I'll just stop receiving messages from the
stream while not being given any kind of error, I suspect it's during some
kind of network fault.

On Fri, Jun 24, 2016, 14:44 Qi Zhao notifications@github.com wrote:

I am not sure I understand what exactly your question is. Do you mean
"conn.Recv()" sometimes falls over with "os.Exit(1)"? If yes, this is
expected, streaming RPC is NOT tolerant to the network errors.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#734 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAG_zn1Qn9xm7r6ULLK991akO2aLfDe5ks5qPCWPgaJpZM4I-AmI
.

iamqizhao · 2016-06-24T19:17:03Z

I am still confused. What do you mean "stop receiving messages"?

AmandaCameron · 2016-06-24T20:24:23Z

Events being published to the stream by the server are never appearing on
the client, dispite being dispatched from the server's side.

On Fri, Jun 24, 2016 at 3:17 PM Qi Zhao notifications@github.com wrote:

I am still confused. What do you mean "stop receiving messages"?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#734 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAG_zrNMKJQ45BxzphegTDspt64smohFks5qPC04gaJpZM4I-AmI
.

iamqizhao · 2016-06-24T20:34:06Z

Do you mean "conn.Recv()" is blocking even though you know the server sends something back to the client? This should not happen. Can you send me a reproduction?

AmandaCameron · 2016-06-25T13:49:33Z

I believe the issue is that there's some kind of network event happening that is causing the TCP stream to be made invalid, and the error isn't being detected / propogated through the stack.

I had talked about this briefly in the GRPC irc channel and @ejona86 pointed me at grpc/grpc-java#1648 -- which I'm planning on attempting as a fix for now.

iamqizhao · 2016-06-27T18:20:15Z

perhaps it is the reason. But even in that case, conn.Recv() will eventually error out when TCP user time out trigger (but maybe up to tens of minutes). One workaround would be hosting the health checking service on your server (https://github.com/grpc/grpc-go/blob/master/health/health.go) so that your client can periodically probe the server and cancel the pending streaming rpc if the server becomes unreachable.

If this is the case, do u mind closing this issue and creating a counterpart issue like grpc/grpc-java#1648 to make your requirement crystal clear?

ejona86 · 2016-06-27T20:05:16Z

@iamqizhao, the Recv() will not error if there is no write. This issue effectively boils down to the same as #536.

And note that health checking is not a full solution, because the health checking RPCs could go on a different TCP connection (let's say GOAWAY was received) and then a NAT drops the TCP connection with the long-lived RPC due to inactivity but keeps the TCP connection for the health checking.

TCP keepalive actually solves the problem, although we're still avoiding it due to DoS, configuration, and waste concerns.

Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734

Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for #632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in #734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates #632 Updates #734

hsaliak · 2017-05-17T01:57:14Z

@dfawley with keepalive support and #536 closed, I am closing this as well.
@AmandaCameron please re-open if you feel that this is still an issue.

rhysh mentioned this issue Nov 30, 2016

transport: fix logical race in flow control #1005

Merged

hsaliak closed this as completed May 17, 2017

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

AmandaCameron commented Jun 24, 2016

iamqizhao commented Jun 24, 2016

AmandaCameron commented Jun 24, 2016

iamqizhao commented Jun 24, 2016

AmandaCameron commented Jun 24, 2016

iamqizhao commented Jun 24, 2016

AmandaCameron commented Jun 25, 2016

iamqizhao commented Jun 27, 2016

ejona86 commented Jun 27, 2016 •

edited

Loading

hsaliak commented May 17, 2017

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734

Comments

AmandaCameron commented Jun 24, 2016

iamqizhao commented Jun 24, 2016

AmandaCameron commented Jun 24, 2016

iamqizhao commented Jun 24, 2016

AmandaCameron commented Jun 24, 2016

iamqizhao commented Jun 24, 2016

AmandaCameron commented Jun 25, 2016

iamqizhao commented Jun 27, 2016

ejona86 commented Jun 27, 2016 • edited Loading

hsaliak commented May 17, 2017

ejona86 commented Jun 27, 2016 •

edited

Loading