-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream seems to silently fall over after indeterminite time, possibly caused by flakey networking #734
Comments
I am not sure I understand what exactly your question is. Do you mean "conn.Recv()" sometimes falls over with "os.Exit(1)"? If yes, this is expected, streaming RPC is NOT tolerant to the network errors. |
I mean the opposite, sometimes I'll just stop receiving messages from the On Fri, Jun 24, 2016, 14:44 Qi Zhao notifications@github.com wrote:
|
I am still confused. What do you mean "stop receiving messages"? |
Events being published to the stream by the server are never appearing on On Fri, Jun 24, 2016 at 3:17 PM Qi Zhao notifications@github.com wrote:
|
Do you mean "conn.Recv()" is blocking even though you know the server sends something back to the client? This should not happen. Can you send me a reproduction? |
I believe the issue is that there's some kind of network event happening that is causing the TCP stream to be made invalid, and the error isn't being detected / propogated through the stack. I had talked about this briefly in the GRPC irc channel and @ejona86 pointed me at grpc/grpc-java#1648 -- which I'm planning on attempting as a fix for now. |
perhaps it is the reason. But even in that case, conn.Recv() will eventually error out when TCP user time out trigger (but maybe up to tens of minutes). One workaround would be hosting the health checking service on your server (https://github.com/grpc/grpc-go/blob/master/health/health.go) so that your client can periodically probe the server and cancel the pending streaming rpc if the server becomes unreachable. If this is the case, do u mind closing this issue and creating a counterpart issue like grpc/grpc-java#1648 to make your requirement crystal clear? |
@iamqizhao, the Recv() will not error if there is no write. This issue effectively boils down to the same as #536. And note that health checking is not a full solution, because the health checking RPCs could go on a different TCP connection (let's say GOAWAY was received) and then a NAT drops the TCP connection with the long-lived RPC due to inactivity but keeps the TCP connection for the health checking. TCP keepalive actually solves the problem, although we're still avoiding it due to DoS, configuration, and waste concerns. |
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for grpc#632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in grpc#734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates grpc#632 Updates grpc#734
Remove the add and cancel methods of quotaPool. Their use is not required, and leads to logical races when used concurrently from multiple goroutines. Rename the reset method to add. The typical way that a goroutine claims quota is to call the add method and then to select on the channel returned by the acquire method. If two goroutines are both trying to claim quota from a single quotaPool, the second call to the add method can happen before the first attempt to read from the channel. When that happens the second goroutine to attempt the channel read will end up waiting for a very long time, in spite of its efforts to prepare the channel for reading. The quotaPool will always behave correctly when any positive quota is on the channel rather than stored in the struct field. In the opposite case, when positive quota is only in the struct field and not on the channel, users of the quotaPool can fail to access available quota. Err on the side of storing any positive quota in the channel. This includes a reproducer for #632, which fails on many runs with this package at v1.0.4. The frequency of the test failures depends on how stressed the server is, since it's now effectively checking for weird interleavings of goroutines. It passes reliably with these changes to the transport package. The behavior described in #734 (an RPC with a streaming response hangs unexpectedly) matches what I've seen in my programs, and what I see in the test case added here. If it's a logical flow control bug, this change may well fix it. Updates #632 Updates #734
@dfawley with keepalive support and #536 closed, I am closing this as well. |
I've got a service written that I use to funnel alert notifications around my various infra, and I'm seeing that the listeners for this service's alerts is intermitently falling over and not spitting out any more alerts.
Best I can guess is that the stream.Recv() call isn't returning an error as it should be in this case.
The relevent code:
The text was updated successfully, but these errors were encountered: