Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TLS Race Connecting to InfluxDB #5504

Merged
merged 1 commit into from
Aug 17, 2017

Conversation

spjmurray
Copy link
Contributor

Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner. This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read. Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.

This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.

Fixes #5460 #5469

Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner.  This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read.  Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.

This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.

Fixes Icinga#5460 Icinga#5469
@spjmurray
Copy link
Contributor Author

And dare I say far more likely now I've got my early morning 4th dimension hat on, influxdb responds before we issue the poll, sit there waiting for an event that has already happened, timeout and same result.

Patch has been running in staging now for ~18 hours with no signs of recurrence 👍 testing wise. @dnsmichi if you would assign a reviewer

@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 15, 2017

@spjmurray thanks for the analysis and patch. Is there any chance that we reliably keep the socket_timout option, or do we need to just remove it (this is a breaking config change).

You've mentioned tcp keepalive ... is there any chance we could tell the InfluxDB HTTP API to close the connection if it is hanging?

@spjmurray
Copy link
Contributor Author

The polling was in case influxdb didn't respond in the first place, so trying to tell it to hang up is probably not going to fly. I think what I originally intended was some way for TlsScoket::Read() to not block forever and cause a massive deadlock.

As regards the breaking option, we could pass it down to the socket event engine to be used in epoll_wait. However given it's used everywhere I'd say no because I don't want to break everything 😉

@dnsmichi
Copy link
Contributor

Thanks, I have another idea - log a warning in 2.7.1 for socket_timeout being removed in 2.8, which is basically a no-op here. That way we can ensure that 2.7.x configs don't break, that's my concern.

In terms of hanging InfluxDB connections, we can see that with the feature's work queue item count increasing and whatnot.

@dnsmichi dnsmichi self-requested a review August 16, 2017 10:12
@dnsmichi dnsmichi added bug Something isn't working area/influxdb Metrics to InfluxDB labels Aug 16, 2017
@dnsmichi
Copy link
Contributor

I'll add a specific commit to support/2.7 after I've cherry-picked your PR, thanks again!

Copy link
Contributor

@dnsmichi dnsmichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dnsmichi dnsmichi added this to the 2.7.1 milestone Aug 17, 2017
@dnsmichi dnsmichi merged commit cb94b21 into Icinga:master Aug 17, 2017
dnsmichi pushed a commit that referenced this pull request Aug 17, 2017
Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner.  This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read.  Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.

This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.

Fixes #5460 #5469

refs #5504
dnsmichi pushed a commit that referenced this pull request Aug 17, 2017
@dnsmichi dnsmichi added the backported Fix was included in a bugfix release label Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb Metrics to InfluxDB backported Fix was included in a bugfix release bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants