-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TLS Race Connecting to InfluxDB #5504
Fix TLS Race Connecting to InfluxDB #5504
Conversation
Rather than leaving stale connections about we tried to poll for data coming in from InfluxDB and timeout if it didn't repond in a timely manner. This introduced a race where the timeout triggers, a context switch occurs where data is actually available and the TlsStream spins trying to asynchronously notify that data is available, but which never gets read. Not only does this use up 100% of a core, but it also slowly starves the system of handler threads at which point metrics stop being delivered. This basically removes the poll and timeout, any TLS socket erros should be detected by TCP keep-alives. Fixes Icinga#5460 Icinga#5469
And dare I say far more likely now I've got my early morning 4th dimension hat on, influxdb responds before we issue the poll, sit there waiting for an event that has already happened, timeout and same result. Patch has been running in staging now for ~18 hours with no signs of recurrence 👍 testing wise. @dnsmichi if you would assign a reviewer |
@spjmurray thanks for the analysis and patch. Is there any chance that we reliably keep the socket_timout option, or do we need to just remove it (this is a breaking config change). You've mentioned tcp keepalive ... is there any chance we could tell the InfluxDB HTTP API to close the connection if it is hanging? |
The polling was in case influxdb didn't respond in the first place, so trying to tell it to hang up is probably not going to fly. I think what I originally intended was some way for As regards the breaking option, we could pass it down to the socket event engine to be used in epoll_wait. However given it's used everywhere I'd say no because I don't want to break everything 😉 |
Thanks, I have another idea - log a warning in 2.7.1 for In terms of hanging InfluxDB connections, we can see that with the feature's work queue item count increasing and whatnot. |
I'll add a specific commit to support/2.7 after I've cherry-picked your PR, thanks again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Rather than leaving stale connections about we tried to poll for data coming in from InfluxDB and timeout if it didn't repond in a timely manner. This introduced a race where the timeout triggers, a context switch occurs where data is actually available and the TlsStream spins trying to asynchronously notify that data is available, but which never gets read. Not only does this use up 100% of a core, but it also slowly starves the system of handler threads at which point metrics stop being delivered. This basically removes the poll and timeout, any TLS socket erros should be detected by TCP keep-alives. Fixes #5460 #5469 refs #5504
Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner. This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read. Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.
This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.
Fixes #5460 #5469