-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netty: default grace time for RPCs can leak memory #5443
netty: default grace time for RPCs can leak memory #5443
Conversation
Using Long.MAX_VALUE as the delay for Netty's NioEventLoop#schedule can cause (long) deadlines for scheduled tasks to wrap into negative values, which is unspecified behavior. Recent versions of netty are guarding against overflows, but not all versions of grpc-java are using a recent enough netty. When connections are gracefully closed, Netty's Http2ConnectionHandler sets up a timeout task forcing resources to be freed after a grace period. When their deadlines wrap into negative values, a race was observed in production systems (with grpc-java <= 1.12.1) causing Netty to never properly release buffers associated with active streams in the connection being (gracefully) closed.
Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement. After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.
Regards, |
CLA signed (as an employee of Netflix). |
The fix to Netty was because we filed netty/netty#7970. IIRC that was a Netty regression, which I think was caused by netty/netty#7402 (which impacted Nio, even though the problem being fixed was with Epoll). When this code was originally written, the -1 thing wouldn't work. It seems zpencer from our team changed Netty in netty/netty#6954 to add support, but gRPC didn't swap over to using it. It looks like it would be beneficial to use -1 so that the ClosingChannelFutureListener will avoid scheduling a task. It seems that no harm should come from that task with any recent version of Netty, however. It is weird to backport what is essentially a workaround to an unsupported release of gRPC from over a year ago for a Netty bug. We only support the two most recent releases; while we can still help with releases older than that, this is a really old release. Could you give a bit of a glimpse why it is hard to upgrade from grpc-java v1.10? But also, it looks like ClosingChannelFutureListener properly cleans up the scheduled timer when the channel closes. It seems like the channel must still alive; if it was closed then the HTTP/2 code would have released the buffers. So I don't think this change alone will solve your problem. It's unclear to me why the connections are still alive. The only way I'd know for them to be alive is if RPCs were still running on them. Are you doing long-lived RPCs? maxConnectionAgeGrace defaults to infinite, but so does maxConnectionAge. Since you are setting maxConnectionAge, then it seems very reasonable to set maxConnectionAgeGrace as well. Although that should really only matter if you have long-lived RPCs. |
@@ -60,7 +60,7 @@ | |||
|
|||
static final long MAX_CONNECTION_IDLE_NANOS_DISABLED = Long.MAX_VALUE; | |||
static final long MAX_CONNECTION_AGE_NANOS_DISABLED = Long.MAX_VALUE; | |||
static final long MAX_CONNECTION_AGE_GRACE_NANOS_INFINITE = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much rather you leave all this as-is. Instead, only change secondGoAwayAndClose() to convert from this value to -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion there. Will do.
@@ -96,6 +96,7 @@ | |||
* the context of the Netty Channel thread. | |||
*/ | |||
class NettyServerHandler extends AbstractNettyHandler { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid full-file formatters; every developer can have slightly different settings and they can constantly "fight" causing lots of useless churn. Please revert these unnecessary changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I'd suggest updating CONTRIBUTING.md
with that info, I just followed it and used the recommended Google settings for IntelliJ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, CONTRIBUTING.md
already has it. I missed it somehow:
Don't fix code style and formatting unless you are already changing that line to address an issue. PRs with irrelevant changes won't be merged. If you do want to fix formatting or style, do that in a separate PR.
@ejona86 I made the changes you suggested. PTAL.
Long story short: transitive dependencies. We have tons of other internal libraries stuck at guava 19.x, so we can not easily upgrade
Exactly, channels never being closed is the real root cause here. My read of What solved my problem was the workaround of always forcing a reasonable (short) grace time for RPCs, since this seems to be the only way to force these Streams to be properly closed when the underlying TCP connection is terminated. This is what I tried to convey in the last two paragraphs of my "Longer report". If there is really a problem here, it seems dangerous having an infinite timeout (no timeout) as the default, as it will likely cause memory leaks often. TBH I'm a bit surprised this hasn't been experienced (or noticed) by others yet.
All our RPCs are documented here: https://github.com/Netflix/titus-api-definitions/blob/0.0.1-rc57/src/main/proto/netflix/titus/titus_job_api.proto#L748 None of them are particularly long, and should take at most a few seconds to complete. There are a few server streaming RPCs ( |
Keep it consistent with other timeout configurations.
Ah, that I fully understand. It's useful to know that not all of the old Guava versions are out of circulation yet. If there's something small we can do to help, we may be open to the idea. But I also understand it just sort of takes time.
Thanks for pointing that out. I think I may know what's happening. You may be bitten by half-close cleanup issues: #4202. The client-side fix was in 1.11. It seems the server-side is still not fixed: #2162 . (Only one has to be fixed to avoid leaks, but we can't guarantee that the remote side will behave as we would hope.) Are the clients in a different language or were all tests using grpc-java 1.10 clients? The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.
The problem is the TCP connection is never being closed, because there are still believed to be RPCs on it :-). |
The code looks good. We'll probably want to reword the commit message to make it more clear this is more of an optimization at this point. |
Oh nice! The issues you linked seem to be the root cause indeed. Our clients are in varied languages and versions, we have no control over their code. Ideally we'd prefer to protect servers against all and any potentially bad clients.
How can the server know when clients are about to cut their side of the connection?
By TCP connection being closed I mean the actual socket is in a terminal state, because the TCP connection was broken (TCP FINs, timeouts, etc). I'm not familiar enough with the HTTP2 spec and Netty internals to know if streams will see an
SGTM. Feel free to do it as you squash commits for merging. Or let me know if you prefer I amend mine. |
I was saying all that from the HTTP/2 stream perspective; TCP connections didn't come into play. Client half close is just that the client finished sending the request. |
it's pretty usual in our environment for clients to cut the (server) streaming RPC by just doing |
If the clients die then things should be better off. If the client's kernel sends a TCP RST and it is received by the server, everything is golden; the channel will be cleaned up. If the client machine dies or there's a network breakage or the TCP RST is dropped by the network, the server may get stuck. That is why we have NettyServerBuilder.keepAliveTime. It should be defaulting to 2 hours to day, so eventually the server will notice and clean things up. |
Got it. That is definitely not happening in my case then, the memory leaking builds up throughout the day, and we see ~40GB of ram leaking on a period of ~11h, when we then kill these processes/instances. |
What if |
If the keepalive times out, then it kills all RPCs: https://github.com/grpc/grpc-java/blob/v1.19.0/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L831 . And that would catch ones leaked by the half close issue. |
hrmm, then I can't explain why the leaked half closed ones are not being reaped by the ping timeouts. There might be another issue somewhere else. We are using the default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@fabiokung Merged, thanks! |
tl;dr
Using
Long.MAX_VALUE
as the delay for Netty'sNioEventLoop#schedule
can cause (long
) deadlines for scheduled tasks to wrap into negative values, which is unspecified behavior. Recent versions of netty are guarding against overflows, but not all versions of grpc-java are using a recentenough netty.
When connections are gracefully closed, Netty's
Http2ConnectionHandler
sets up a timeout task forcing resources to be freed after a grace period. When deadlines for these tasks wrap into negative values, a race was observed in production systems (with grpc-java <= 1.12.1) causing Nettyto never properly release buffers associated with active streams in the connection being (gracefully) closed.
Once I explicitly forced a reasonable value (5s) for
Http2ConnectionHandler#gracefulShutdownTimeoutMillis
(throughNettyServerBuilder#maxConnectionAgeGrace
) the memory leak went away.Recent versions of
grpc-java
will potentially not be impacted by this - or be less impacted - since Netty is now protecting against overflows.Unfortunately, we are currently stuck with
1.10.x
(and potentially1.12.x
in the short term), so it would be great to have the fix backported all the way back to these branches. The fix is very simple, let me know if I can help backporting it.Longer report
We are observing a (slow) memory leak with grpc-java
1.10.1
and1.12.1
(which seems to be present in currentmaster
as well, under some conditions). Our gRPC servers see some amount of channels being disconnected without a full http2 handshake (properly closing both sides). We have no control over this as clients can just terminate TCP connections without going through a fullManagedChannel#shutdown()
. Our logs will occasionally show:That should be harmless, but this fact may be relevant later on.
Looking at heapdumps from instances with 5-30GB of memory leaked, all of the extra memory is off-heap (
DirectByteBufs
). Unfortunately, I can not share the heapdumps as they have sensitive information.The leaked
DirectByteBufs
are retained by grpc's nettyPooledUnsafeDirectByteBuf
objects:Looking at what's retaining these buffers ("shortest path to GC roots"), we can see that
Http2ConnectionHandler$ClosingChannelFutureListener
objects are part of the reference graph:To me that was a strong indication that buffers are being leaked while connections/channels/streams are being closed. We can also see that buffers are being held in the
DefaultHttp2RemoteFlowController$FlowState#pendingWriteQueue
.I then tried to understand why these buffers were not being flushed, and got as far as
WeightedFairQueueByteDistributor
not being able to write them (possibly due tostreamableBytes
already being 0 because of theonStreamClosed()
callback, or because the underlying window of the connection will never be> 0
, since the connection is already closed). TheDefaultHttp2Connection$DefaultStream
holding all thesependingWrite
buffers is in aHALF_CLOSED_REMOTE
state, andDefaultHttp2RemoteFlowController$FlowState
instances are not writeable:My read of the Netty code (and there is high probability I missed something) is that active channels in this state will stay stuck forever, until they are forcedly reaped by
DefaultHttp2RemoteFlowController$FlowState#cancel
, which only happens when both sides of the stream are closed (HALF_CLOSED
is not enough).Next thing I started to investigate is why these streams were never being forced closed. It turns out
Http2ConnectionHandler$ClosingChannelFutureListener#timeoutTask
had negative deadline values:These timeout tasks were apparently never being executed by the
NioEventLoop
(executor for scheduled tasks in netty), so streams were never being forced closed.... which brings us to the fix in this PR, and to my current workaround of always explicitly setting a reasonable value for
NettyServerBuilder#maxConnectionAgeGrace
. We are using 5s, since we don't have any RPCs that should take longer than that after channels are closed.Regardless of fixing negative deadlines for scheduled
timeoutTasks
, from what I can tell, there may be still an issue with timeouts being too large. Memory may still be leaked if streams are never forced closed, and if grace times for RPC calls are too high (or indefinite - the default).A proper fix for this seems more complicated, and may need changes on Netty to make sure all active streams are forced closed (both sides) when the underlying TCP connection is gone. At that point there is no much sense in holding onto buffers,
pendingWrites
will never be completed.