-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File descriptor leak caused by clients prematurely closing connections #327
Comments
Hmm, what happens if you use a try/finally for the close? |
Do you mean in It could look something like
but I don't think that would fix this - there are no exceptions thrown, just the I'm not familiar enough with Java socket programming to suggest a better fix, but surely Jetty should be able to better handle the client closing the connection? |
That's not how TCP works, only one direction of the socket is closed. What exactly is netstat showing for these sockets? It also shouldn't be possible for there to be thousands of them, as there's only 5 worker threads. |
Yes of course - I've clarified in the description, thanks. I don't have
|
I was interested more in the SendQ from netstat. |
@brian-brazil just grabbed some output from a node that started being affected again. Here's an example line:
All of the |
Hmm, that smells like a connection that's in the backlog - that is it's in the kernel but the application has yet to |
Yup, it's looking like the workers get stuck writing to a socket whilst the HTTPServer background thread (?) continues to accept connections from Prometheus, but never actually reads or closes them, presumably as all worker threads are blocked. Does that sound about right? I think we'll try Jetty (I think it should be able to handle this situation better), will post here with what happens. |
That sounds plausible. Do you know where they're blocking? |
Yep, all worker threads look like this:
so they're blocking on |
That's odd, looks like it's stuck in the kernel. Can you get an strace? Could you also try with the master version of our httpserver library? There's one fix there which has yet to make it out and has the potential to be relevant. |
We ran into a similar issue using v0.4.0 where HTTPServer accepts connections but never responds. |
Any update on this? |
That you get a chance to try this with a newer version of client_java? |
@brian-brazil We moved the main jmx exporter out of the Cassandra process to an external HTTP server because we couldn't afford to have fd leaks and restart Cassandra once in a while. Now we run 2 copies of the JMX exporter with the in process version scraping only the minimal jvm metrics and the external exporter scraping much more detailed Cassandra metrics. Since it's an external process which gets restarted automatically by systemd if it crashes, we haven't really looked into it since. Its not an optimal solution, but saved us from the problematic db restarts. |
@brian-brazil any solution on this issue please. |
Hello @jaseemabid , I am facing same issue right now and I also wanted to avoid the restart of my DB at specific interval. Can you tell me how you have fixed it in detail? It would be really great if I could get any help for the same. |
@amruta1989mohite Run 2 copies of this exporter.
|
@jaseemabid, Thanks for the response and sorry for bothering you again. |
@jaseemabid Should I wait for your help? |
@brian-brazil I am facing similar issue when I am using python's prometheus client.
Can you please mention the link to the library? |
Doesn't look like we're getting any more information out of this, and there's been fixes since. |
@brian-brazil At monzo, we migrated to https://github.com/instaclustr/cassandra-exporter, found similar issues and then wrote our own https://github.com/suhailpatel/seastat. |
That smells like an issue inside Cassandra itself. |
All 5 threads are blocked but the connection is received because the thread or connection is different. |
Hi! 👋We've been using JMX exporter to instrument Cassandra (using the javaagent on version 0.3.1).
We recently had an incident caused by Cassandra running out of file descriptors. We found these had been gradually leaking over time (metric here is
node_filefd_allocated
from node_exporters on those instances - the FD limit we set for Cassandra is 100k):We'd been seeing some issues with Prometheus timing out whilst scraping these nodes, and found that the majority of open FDs were orphaned TCP sockets in
CLOSE_WAIT
. Thread dumps showed that all 5 JMX exporter threads on these nodes seemed to be stuck writing to the socket:Putting these two bits of information together gives us this theory:
FIN
It looks like
simpleclient_httpserver
doesn't have good semantics around handling closed connections.We don't have a minimal reproduction of this, but tcpdumps back this up. We're considering forking the jmx_exporter to use
simpleclient_jetty
instead, but we wondered if anyone else had come across this?The text was updated successfully, but these errors were encountered: