netty: default grace time for RPCs can leak memory #5443

fabiokung · 2019-03-07T19:46:53Z

tl;dr

Using Long.MAX_VALUE as the delay for Netty's NioEventLoop#schedule can cause (long) deadlines for scheduled tasks to wrap into negative values, which is unspecified behavior. Recent versions of netty are guarding against overflows, but not all versions of grpc-java are using a recent
enough netty.

When connections are gracefully closed, Netty's Http2ConnectionHandler sets up a timeout task forcing resources to be freed after a grace period. When deadlines for these tasks wrap into negative values, a race was observed in production systems (with grpc-java <= 1.12.1) causing Netty
to never properly release buffers associated with active streams in the connection being (gracefully) closed.

Once I explicitly forced a reasonable value (5s) for Http2ConnectionHandler#gracefulShutdownTimeoutMillis (through NettyServerBuilder#maxConnectionAgeGrace) the memory leak went away.

Recent versions of grpc-java will potentially not be impacted by this - or be less impacted - since Netty is now protecting against overflows.

Unfortunately, we are currently stuck with 1.10.x (and potentially 1.12.x in the short term), so it would be great to have the fix backported all the way back to these branches. The fix is very simple, let me know if I can help backporting it.

Longer report

We are observing a (slow) memory leak with grpc-java 1.10.1 and 1.12.1 (which seems to be present in current master as well, under some conditions). Our gRPC servers see some amount of channels being disconnected without a full http2 handshake (properly closing both sides). We have no control over this as clients can just terminate TCP connections without going through a full ManagedChannel#shutdown(). Our logs will occasionally show:

WARNING: Stream Error
io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place

That should be harmless, but this fact may be relevant later on.

Looking at heapdumps from instances with 5-30GB of memory leaked, all of the extra memory is off-heap (DirectByteBufs). Unfortunately, I can not share the heapdumps as they have sensitive information.

The leaked DirectByteBufs are retained by grpc's netty PooledUnsafeDirectByteBuf objects:

Looking at what's retaining these buffers ("shortest path to GC roots"), we can see that Http2ConnectionHandler$ClosingChannelFutureListener objects are part of the reference graph:

To me that was a strong indication that buffers are being leaked while connections/channels/streams are being closed. We can also see that buffers are being held in the DefaultHttp2RemoteFlowController$FlowState#pendingWriteQueue.

I then tried to understand why these buffers were not being flushed, and got as far as WeightedFairQueueByteDistributor not being able to write them (possibly due to streamableBytes already being 0 because of the onStreamClosed() callback, or because the underlying window of the connection will never be > 0, since the connection is already closed). The DefaultHttp2Connection$DefaultStream holding all these pendingWrite buffers is in a HALF_CLOSED_REMOTE state, and DefaultHttp2RemoteFlowController$FlowState instances are not writeable:

My read of the Netty code (and there is high probability I missed something) is that active channels in this state will stay stuck forever, until they are forcedly reaped by DefaultHttp2RemoteFlowController$FlowState#cancel, which only happens when both sides of the stream are closed (HALF_CLOSED is not enough).

Next thing I started to investigate is why these streams were never being forced closed. It turns out Http2ConnectionHandler$ClosingChannelFutureListener#timeoutTask had negative deadline values:

These timeout tasks were apparently never being executed by the NioEventLoop (executor for scheduled tasks in netty), so streams were never being forced closed.

... which brings us to the fix in this PR, and to my current workaround of always explicitly setting a reasonable value for NettyServerBuilder#maxConnectionAgeGrace. We are using 5s, since we don't have any RPCs that should take longer than that after channels are closed.

Regardless of fixing negative deadlines for scheduled timeoutTasks, from what I can tell, there may be still an issue with timeouts being too large. Memory may still be leaked if streams are never forced closed, and if grace times for RPC calls are too high (or indefinite - the default).

A proper fix for this seems more complicated, and may need changes on Netty to make sure all active streams are forced closed (both sides) when the underlying TCP connection is gone. At that point there is no much sense in holding onto buffers, pendingWrites will never be completed.

Using Long.MAX_VALUE as the delay for Netty's NioEventLoop#schedule can cause (long) deadlines for scheduled tasks to wrap into negative values, which is unspecified behavior. Recent versions of netty are guarding against overflows, but not all versions of grpc-java are using a recent enough netty. When connections are gracefully closed, Netty's Http2ConnectionHandler sets up a timeout task forcing resources to be freed after a grace period. When their deadlines wrap into negative values, a race was observed in production systems (with grpc-java <= 1.12.1) causing Netty to never properly release buffers associated with active streams in the connection being (gracefully) closed.

thelinuxfoundation · 2019-03-07T19:46:55Z

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

User @fabiokung isn't covered by a CLA. They will need to complete the form at https://identity.linuxfoundation.org/projects/cncf

Regards,
CLA GitHub bot

fabiokung · 2019-03-07T19:51:08Z

CLA signed (as an employee of Netflix).

ejona86 · 2019-03-07T21:38:37Z

CC @carl-mastrangelo

The fix to Netty was because we filed netty/netty#7970. IIRC that was a Netty regression, which I think was caused by netty/netty#7402 (which impacted Nio, even though the problem being fixed was with Epoll).

When this code was originally written, the -1 thing wouldn't work. It seems zpencer from our team changed Netty in netty/netty#6954 to add support, but gRPC didn't swap over to using it.

It looks like it would be beneficial to use -1 so that the ClosingChannelFutureListener will avoid scheduling a task. It seems that no harm should come from that task with any recent version of Netty, however.

It is weird to backport what is essentially a workaround to an unsupported release of gRPC from over a year ago for a Netty bug. We only support the two most recent releases; while we can still help with releases older than that, this is a really old release. Could you give a bit of a glimpse why it is hard to upgrade from grpc-java v1.10?

But also, it looks like ClosingChannelFutureListener properly cleans up the scheduled timer when the channel closes. It seems like the channel must still alive; if it was closed then the HTTP/2 code would have released the buffers. So I don't think this change alone will solve your problem.

It's unclear to me why the connections are still alive. The only way I'd know for them to be alive is if RPCs were still running on them. Are you doing long-lived RPCs?

maxConnectionAgeGrace defaults to infinite, but so does maxConnectionAge. Since you are setting maxConnectionAge, then it seems very reasonable to set maxConnectionAgeGrace as well. Although that should really only matter if you have long-lived RPCs.

ejona86 · 2019-03-07T21:25:32Z

netty/src/main/java/io/grpc/netty/NettyServerBuilder.java

@@ -60,7 +60,7 @@

  static final long MAX_CONNECTION_IDLE_NANOS_DISABLED = Long.MAX_VALUE;
  static final long MAX_CONNECTION_AGE_NANOS_DISABLED = Long.MAX_VALUE;
-  static final long MAX_CONNECTION_AGE_GRACE_NANOS_INFINITE = Long.MAX_VALUE;


I'd much rather you leave all this as-is. Instead, only change secondGoAwayAndClose() to convert from this value to -1.

No strong opinion there. Will do.

ejona86 · 2019-03-07T21:27:06Z

netty/src/main/java/io/grpc/netty/NettyServerHandler.java

@@ -96,6 +96,7 @@
 * the context of the Netty Channel thread.
 */
 class NettyServerHandler extends AbstractNettyHandler {
+


Please avoid full-file formatters; every developer can have slightly different settings and they can constantly "fight" causing lots of useless churn. Please revert these unnecessary changes.

Will do. I'd suggest updating CONTRIBUTING.md with that info, I just followed it and used the recommended Google settings for IntelliJ.

My bad, CONTRIBUTING.md already has it. I missed it somehow:

Don't fix code style and formatting unless you are already changing that line to address an issue. PRs with irrelevant changes won't be merged. If you do want to fix formatting or style, do that in a separate PR.

fabiokung · 2019-03-08T19:37:07Z

@ejona86 I made the changes you suggested. PTAL.

Could you give a bit of a glimpse why it is hard to upgrade from grpc-java v1.10?

Long story short: transitive dependencies. We have tons of other internal libraries stuck at guava 19.x, so we can not easily upgrade grpc-java. We are working on it, but it is going to take months.

But also, it looks like ClosingChannelFutureListener properly cleans up the scheduled timer when the channel closes. It seems like the channel must still alive; if it was closed then the HTTP/2 code would have released the buffers. So I don't think this change alone will solve your problem.

Exactly, channels never being closed is the real root cause here. My read of DefaultHttp2RemoteFlowController is that resources will only be freed when both sides of the Stream are closed, which is never happening (unless the timeout task from ClosingChannelFutureListener fires and forces it to be closed).

What solved my problem was the workaround of always forcing a reasonable (short) grace time for RPCs, since this seems to be the only way to force these Streams to be properly closed when the underlying TCP connection is terminated. This is what I tried to convey in the last two paragraphs of my "Longer report". If there is really a problem here, it seems dangerous having an infinite timeout (no timeout) as the default, as it will likely cause memory leaks often. TBH I'm a bit surprised this hasn't been experienced (or noticed) by others yet.

It's unclear to me why the connections are still alive. The only way I'd know for them to be alive is if RPCs were still running on them. Are you doing long-lived RPCs?

All our RPCs are documented here: https://github.com/Netflix/titus-api-definitions/blob/0.0.1-rc57/src/main/proto/netflix/titus/titus_job_api.proto#L748

None of them are particularly long, and should take at most a few seconds to complete. There are a few server streaming RPCs (JobManagementService::ObserveJob[s]) that keep the channel open for longer. However, even with server/client stream RPCs, it seems to me that resources should be freed as soon as the underlying TCP connection is closed. An infinite timeout doesn't make sense here, since pending writes will never be able to complete.

Keep it consistent with other timeout configurations.

ejona86 · 2019-03-08T21:35:07Z

internal libraries stuck at guava 19.x

Ah, that I fully understand. It's useful to know that not all of the old Guava versions are out of circulation yet. If there's something small we can do to help, we may be open to the idea. But I also understand it just sort of takes time.

My read of DefaultHttp2RemoteFlowController is that resources will only be freed when both sides of the Stream are closed, which is never happening (unless the timeout task from ClosingChannelFutureListener fires and forces it to be closed).

Thanks for pointing that out. I think I may know what's happening. You may be bitten by half-close cleanup issues: #4202. The client-side fix was in 1.11. It seems the server-side is still not fixed: #2162 . (Only one has to be fixed to avoid leaks, but we can't guarantee that the remote side will behave as we would hope.) Are the clients in a different language or were all tests using grpc-java 1.10 clients?

The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.

it seems to me that resources should be freed as soon as the underlying TCP connection is closed

The problem is the TCP connection is never being closed, because there are still believed to be RPCs on it :-).

ejona86 · 2019-03-08T22:09:09Z

The code looks good. We'll probably want to reword the commit message to make it more clear this is more of an optimization at this point.

fabiokung · 2019-03-08T22:13:44Z

Oh nice! The issues you linked seem to be the root cause indeed. Our clients are in varied languages and versions, we have no control over their code. Ideally we'd prefer to protect servers against all and any potentially bad clients.

The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.

How can the server know when clients are about to cut their side of the connection?

The problem is the TCP connection is never being closed, because there are still believed to be RPCs on it :-)

By TCP connection being closed I mean the actual socket is in a terminal state, because the TCP connection was broken (TCP FINs, timeouts, etc). I'm not familiar enough with the HTTP2 spec and Netty internals to know if streams will see an END_STREAM in all cases when it happens, but if so, your proposal on netty/netty#8275 seems reasonable.

We'll probably want to reword the commit message to make it more clear this is more of an optimization at this point.

SGTM. Feel free to do it as you squash commits for merging. Or let me know if you prefer I amend mine.

ejona86 · 2019-03-08T22:37:37Z

The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.

How can the server know when clients are about to cut their side of the connection?

I was saying all that from the HTTP/2 stream perspective; TCP connections didn't come into play. Client half close is just that the client finished sending the request.

fabiokung · 2019-03-08T22:39:49Z

it's pretty usual in our environment for clients to cut the (server) streaming RPC by just doing exit(2). In any case these client processes can just die at any point without being a well-behaved http2/grpc client.

ejona86 · 2019-03-08T22:44:00Z

In any case these client processes can just die at any point without being a well-behaved http2/grpc client.

If the clients die then things should be better off. If the client's kernel sends a TCP RST and it is received by the server, everything is golden; the channel will be cleaned up. If the client machine dies or there's a network breakage or the TCP RST is dropped by the network, the server may get stuck. That is why we have NettyServerBuilder.keepAliveTime. It should be defaulting to 2 hours to day, so eventually the server will notice and clean things up.

fabiokung · 2019-03-08T22:47:45Z

Got it. That is definitely not happening in my case then, the memory leaking builds up throughout the day, and we see ~40GB of ram leaking on a period of ~11h, when we then kill these processes/instances.

fabiokung · 2019-03-08T22:53:16Z

What if NettyServerBuilder.keepAliveTime is being interrupted/impacted by the Http2Connection being in the middle of a gracefulShutdown? (ClosingChannelFutureListener is involved)

ejona86 · 2019-03-08T23:12:00Z

If the keepalive times out, then it kills all RPCs: https://github.com/grpc/grpc-java/blob/v1.19.0/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L831 . And that would catch ones leaked by the half close issue.

fabiokung · 2019-03-08T23:24:47Z

hrmm, then I can't explain why the leaked half closed ones are not being reaped by the ping timeouts. There might be another issue somewhere else.

We are using the default NettyServerBuilder.keepAliveTime on 1.10.1, which seems to be 2h (with a timeout of 20s).

dapengzhang0 · 2019-03-18T16:50:29Z

@carl-mastrangelo ?

carl-mastrangelo

LGTM

carl-mastrangelo · 2019-03-18T19:39:26Z

@fabiokung Merged, thanks!

carl-mastrangelo requested a review from ejona86 March 7, 2019 20:38

ejona86 reviewed Mar 7, 2019

View reviewed changes

netty: revert to Long.MAX_VALUE for no timeout

1cc0d27

Keep it consistent with other timeout configurations.

ejona86 approved these changes Mar 8, 2019

View reviewed changes

ejona86 requested a review from carl-mastrangelo March 8, 2019 22:08

ejona86 mentioned this pull request Mar 8, 2019

Servers should send RST_STREAM(NO_ERROR) after trailers if client is not half-closed #2162

Open

creamsoup added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 12, 2019

grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 12, 2019

dapengzhang0 added this to the 1.20 milestone Mar 15, 2019

carl-mastrangelo approved these changes Mar 18, 2019

View reviewed changes

carl-mastrangelo added the kokoro:force-run Add this label to a PR to tell Kokoro to re-run all tests. Not generally necessary label Mar 18, 2019

grpc-kokoro removed the kokoro:force-run Add this label to a PR to tell Kokoro to re-run all tests. Not generally necessary label Mar 18, 2019

carl-mastrangelo merged commit 7df2d5f into grpc:master Mar 18, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 16, 2019

fabiokung deleted the fix-rpc-graceful-leak branch July 15, 2019 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netty: default grace time for RPCs can leak memory #5443

netty: default grace time for RPCs can leak memory #5443

fabiokung commented Mar 7, 2019

thelinuxfoundation commented Mar 7, 2019

fabiokung commented Mar 7, 2019

ejona86 commented Mar 7, 2019

ejona86 Mar 7, 2019

fabiokung Mar 8, 2019

ejona86 Mar 7, 2019

fabiokung Mar 8, 2019

fabiokung Mar 8, 2019 •

edited

Loading

fabiokung commented Mar 8, 2019 •

edited

Loading

ejona86 commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019

fabiokung commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019 •

edited

Loading

dapengzhang0 commented Mar 18, 2019

carl-mastrangelo left a comment

carl-mastrangelo commented Mar 18, 2019

netty: default grace time for RPCs can leak memory #5443

netty: default grace time for RPCs can leak memory #5443

Conversation

fabiokung commented Mar 7, 2019

tl;dr

Longer report

thelinuxfoundation commented Mar 7, 2019

fabiokung commented Mar 7, 2019

ejona86 commented Mar 7, 2019

ejona86 Mar 7, 2019

Choose a reason for hiding this comment

fabiokung Mar 8, 2019

Choose a reason for hiding this comment

ejona86 Mar 7, 2019

Choose a reason for hiding this comment

fabiokung Mar 8, 2019

Choose a reason for hiding this comment

fabiokung Mar 8, 2019 • edited Loading

Choose a reason for hiding this comment

fabiokung commented Mar 8, 2019 • edited Loading

ejona86 commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019

fabiokung commented Mar 8, 2019

ejona86 commented Mar 8, 2019

fabiokung commented Mar 8, 2019 • edited Loading

dapengzhang0 commented Mar 18, 2019

carl-mastrangelo left a comment

Choose a reason for hiding this comment

carl-mastrangelo commented Mar 18, 2019

fabiokung Mar 8, 2019 •

edited

Loading

fabiokung commented Mar 8, 2019 •

edited

Loading

fabiokung commented Mar 8, 2019 •

edited

Loading