Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptor leak caused by clients prematurely closing connections #327

Closed
milesbxf opened this issue Oct 17, 2018 · 25 comments
Closed

Comments

@milesbxf
Copy link

milesbxf commented Oct 17, 2018

Hi! 👋We've been using JMX exporter to instrument Cassandra (using the javaagent on version 0.3.1).

We recently had an incident caused by Cassandra running out of file descriptors. We found these had been gradually leaking over time (metric here is node_filefd_allocated from node_exporters on those instances - the FD limit we set for Cassandra is 100k):
screen shot 2018-10-17 at 11 21 57

We'd been seeing some issues with Prometheus timing out whilst scraping these nodes, and found that the majority of open FDs were orphaned TCP sockets in CLOSE_WAIT. Thread dumps showed that all 5 JMX exporter threads on these nodes seemed to be stuck writing to the socket:

"pool-1-thread-1" - Thread t@84
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        - locked <2e084219> (a java.lang.Object)
        at sun.net.httpserver.Request$WriteStream.write(Request.java:391)
        - locked <368dd754> (a sun.net.httpserver.Request$WriteStream)
        at sun.net.httpserver.ChunkedOutputStream.writeChunk(ChunkedOutputStream.java:125)
        at sun.net.httpserver.ChunkedOutputStream.write(ChunkedOutputStream.java:87)
        at sun.net.httpserver.PlaceholderOutputStream.write(ExchangeImpl.java:444)
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
        at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:167)
        - locked <fd8ef5> (a java.io.ByteArrayOutputStream)
        at io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer$HTTPMetricHandler.handle(HTTPServer.java:74)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
        at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
   Locked ownable synchronizers:
        - locked <467e300a> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Putting these two bits of information together gives us this theory:

  1. Prometheus scrapes the node - sends a HTTP request to JMX exporter
  2. JMX exporter collects metrics, but takes a long time to do so (this is occasionally expected in our case, our nodes export thousands of JMX metrics)
  3. Prometheus reaches the scrape timeout, and cancels the request with a TCP FIN
  4. JMX exporter finishes collecting metrics, and attempts to write the output to the socket (https://github.com/prometheus/client_java/blob/parent-0.3.0/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L78). The other side of the TCP connection has been closed.
  5. This call blocks forever, and we never reach https://github.com/prometheus/client_java/blob/parent-0.3.0/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L80 which closes the socket

It looks like simpleclient_httpserver doesn't have good semantics around handling closed connections.

We don't have a minimal reproduction of this, but tcpdumps back this up. We're considering forking the jmx_exporter to use simpleclient_jetty instead, but we wondered if anyone else had come across this?

@brian-brazil
Copy link
Contributor

Hmm, what happens if you use a try/finally for the close?

@milesbxf
Copy link
Author

Do you mean in client_java/simpleclient_httpserver?

It could look something like

        try {
            if (shouldUseCompression(t)) {
                t.getResponseHeaders().set("Content-Encoding", "gzip");
                t.sendResponseHeaders(HttpURLConnection.HTTP_OK, 0);
                final GZIPOutputStream os = new GZIPOutputStream(t.getResponseBody());
                response.writeTo(os);
                os.finish();
            } else {
                t.sendResponseHeaders(HttpURLConnection.HTTP_OK, response.size());
                response.writeTo(t.getResponseBody());
            }
        } finally {
            t.close();
        }

but I don't think that would fix this - there are no exceptions thrown, just the response.writeTo call seems to block forever.

I'm not familiar enough with Java socket programming to suggest a better fix, but surely Jetty should be able to better handle the client closing the connection?

@brian-brazil
Copy link
Contributor

JMX exporter finishes collecting metrics, and attempts to write the output to the now-closed socket

That's not how TCP works, only one direction of the socket is closed. What exactly is netstat showing for these sockets? It also shouldn't be possible for there to be thousands of them, as there's only 5 worker threads.

@milesbxf
Copy link
Author

Yes of course - I've clarified in the description, thanks.

I don't have netstat output for the affected nodes, though I'll grab it when the issue reoccurs. We did analyse the leak with lsof, and found thousands of entries like:

java    17678 cassandra  526u     IPv4          148742999        0t0        TCP hostname-a:http-alt->hostname-b:36374 (CLOSE_WAIT)

hostname-a is the hostname of the Cassandra node we ran this on, and hostname-b is one of the Prometheus hosts. All of the connections in CLOSE_WAIT showed a connection to a Prometheus host and we're exposing the JMX exporter on http-alt port (8080), so these are definitely connections handled by the JMX exporter.

@brian-brazil
Copy link
Contributor

I was interested more in the SendQ from netstat.

@milesbxf
Copy link
Author

@brian-brazil just grabbed some output from a node that started being affected again. Here's an example line:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp      200      0 <cassandra IP>:8080       <prometheus IP>:58288     CLOSE_WAIT  73970/java

All of the CLOSE_WAIT conns look almost identical, all with 200 bytes of data in Recv-Q and 0 in Send-Q - so as I understand it there's 200 bytes of data received on the socket but not read by the JMX exporter

@brian-brazil
Copy link
Contributor

Hmm, that smells like a connection that's in the backlog - that is it's in the kernel but the application has yet to accept it.

@milesbxf
Copy link
Author

Yup, it's looking like the workers get stuck writing to a socket whilst the HTTPServer background thread (?) continues to accept connections from Prometheus, but never actually reads or closes them, presumably as all worker threads are blocked. Does that sound about right?

I think we'll try Jetty (I think it should be able to handle this situation better), will post here with what happens.

@brian-brazil
Copy link
Contributor

Does that sound about right?

That sounds plausible. Do you know where they're blocking?

@milesbxf
Copy link
Author

milesbxf commented Oct 25, 2018

Yep, all worker threads look like this:

"pool-1-thread-1" - Thread t@84
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        - locked <2e084219> (a java.lang.Object)
        at sun.net.httpserver.Request$WriteStream.write(Request.java:391)
        - locked <368dd754> (a sun.net.httpserver.Request$WriteStream)
        at sun.net.httpserver.ChunkedOutputStream.writeChunk(ChunkedOutputStream.java:125)
        at sun.net.httpserver.ChunkedOutputStream.write(ChunkedOutputStream.java:87)
        at sun.net.httpserver.PlaceholderOutputStream.write(ExchangeImpl.java:444)
        at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
        at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
        at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
        at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:167)
        - locked <fd8ef5> (a java.io.ByteArrayOutputStream)
        at io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer$HTTPMetricHandler.handle(HTTPServer.java:74)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
        at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
   Locked ownable synchronizers:
        - locked <467e300a> (a java.util.concurrent.ThreadPoolExecutor$Worker)

so they're blocking on response.writeTo(t.getResponseBody()); - (https://github.com/prometheus/client_java/blob/parent-0.3.0/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L78)

@brian-brazil
Copy link
Contributor

That's odd, looks like it's stuck in the kernel. Can you get an strace?

Could you also try with the master version of our httpserver library? There's one fix there which has yet to make it out and has the potential to be relevant.

@ekazakas
Copy link

We ran into a similar issue using v0.4.0 where HTTPServer accepts connections but never responds.

@xoraes
Copy link

xoraes commented Dec 19, 2018

Any update on this?

@brian-brazil
Copy link
Contributor

That you get a chance to try this with a newer version of client_java?

@jaseemabid
Copy link

@brian-brazil We moved the main jmx exporter out of the Cassandra process to an external HTTP server because we couldn't afford to have fd leaks and restart Cassandra once in a while. Now we run 2 copies of the JMX exporter with the in process version scraping only the minimal jvm metrics and the external exporter scraping much more detailed Cassandra metrics. Since it's an external process which gets restarted automatically by systemd if it crashes, we haven't really looked into it since. Its not an optimal solution, but saved us from the problematic db restarts.

@amruta1989mohite
Copy link

@brian-brazil any solution on this issue please.

@amruta1989mohite
Copy link

amruta1989mohite commented Jan 21, 2019

@brian-brazil We moved the main jmx exporter out of the Cassandra process to an external HTTP server because we couldn't afford to have fd leaks and restart Cassandra once in a while. Now we run 2 copies of the JMX exporter with the in process version scraping only the minimal jvm metrics and the external exporter scraping much more detailed Cassandra metrics. Since it's an external process which gets restarted automatically by systemd if it crashes, we haven't really looked into it since. Its not an optimal solution, but saved us from the problematic db restarts.

Hello @jaseemabid , I am facing same issue right now and I also wanted to avoid the restart of my DB at specific interval. Can you tell me how you have fixed it in detail? It would be really great if I could get any help for the same.

@jaseemabid
Copy link

@amruta1989mohite Run 2 copies of this exporter.

  1. Make the in process exporter scrape only the core JVM metrics and on our infra it completes a scrape in a couple of seconds. This normally doesn't leak FDs and we haven't had any issues so far.

  2. Run an out of process HTTP exporter to scrape everything else you need. This takes much much longer depending on your cluster but since it's a different process, you should be able to safely and automatically restart it once it starts leaking leaks FDs.

@amruta1989mohite
Copy link

@jaseemabid, Thanks for the response and sorry for bothering you again.
But I am quite new to exporters, so is it possible I could get the exporter code from you or any document how you did it in detail.
It would be so helpful, if possible. Thanks in advance.

@amruta1989mohite
Copy link

@jaseemabid Should I wait for your help?

@tusharmakkar08
Copy link

@brian-brazil I am facing similar issue when I am using python's prometheus client.

Could you also try with the master version of our httpserver library? There's one fix there which has yet to make it out and has the potential to be relevant.

Can you please mention the link to the library?

@brian-brazil
Copy link
Contributor

Doesn't look like we're getting any more information out of this, and there's been fixes since.

@jaseemabid
Copy link

@brian-brazil At monzo, we migrated to https://github.com/instaclustr/cassandra-exporter, found similar issues and then wrote our own https://github.com/suhailpatel/seastat.

@brian-brazil
Copy link
Contributor

That smells like an issue inside Cassandra itself.

@hr-ireh
Copy link

hr-ireh commented Jan 2, 2021

JMX exporter finishes collecting metrics, and attempts to write the output to the now-closed socket

That's not how TCP works, only one direction of the socket is closed. What exactly is netstat showing for these sockets? It also shouldn't be possible for there to be thousands of them, as there's only 5 worker threads.

All 5 threads are blocked but the connection is received because the thread or connection is different.

jmx_01
jmx_03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants