Fix TLS Race Connecting to InfluxDB #5504

spjmurray · 2017-08-14T15:34:23Z

Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner. This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read. Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.

This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.

Fixes #5460 #5469

Rather than leaving stale connections about we tried to poll for data coming in from InfluxDB and timeout if it didn't repond in a timely manner. This introduced a race where the timeout triggers, a context switch occurs where data is actually available and the TlsStream spins trying to asynchronously notify that data is available, but which never gets read. Not only does this use up 100% of a core, but it also slowly starves the system of handler threads at which point metrics stop being delivered. This basically removes the poll and timeout, any TLS socket erros should be detected by TCP keep-alives. Fixes Icinga#5460 Icinga#5469

spjmurray · 2017-08-15T08:11:09Z

And dare I say far more likely now I've got my early morning 4th dimension hat on, influxdb responds before we issue the poll, sit there waiting for an event that has already happened, timeout and same result.

Patch has been running in staging now for ~18 hours with no signs of recurrence 👍 testing wise. @dnsmichi if you would assign a reviewer

dnsmichi · 2017-08-15T08:22:53Z

@spjmurray thanks for the analysis and patch. Is there any chance that we reliably keep the socket_timout option, or do we need to just remove it (this is a breaking config change).

You've mentioned tcp keepalive ... is there any chance we could tell the InfluxDB HTTP API to close the connection if it is hanging?

spjmurray · 2017-08-15T08:37:02Z

The polling was in case influxdb didn't respond in the first place, so trying to tell it to hang up is probably not going to fly. I think what I originally intended was some way for TlsScoket::Read() to not block forever and cause a massive deadlock.

As regards the breaking option, we could pass it down to the socket event engine to be used in epoll_wait. However given it's used everywhere I'd say no because I don't want to break everything 😉

dnsmichi · 2017-08-16T10:11:03Z

Thanks, I have another idea - log a warning in 2.7.1 for socket_timeout being removed in 2.8, which is basically a no-op here. That way we can ensure that 2.7.x configs don't break, that's my concern.

In terms of hanging InfluxDB connections, we can see that with the feature's work queue item count increasing and whatnot.

dnsmichi · 2017-08-17T16:24:07Z

I'll add a specific commit to support/2.7 after I've cherry-picked your PR, thanks again!

dnsmichi

👍

Rather than leaving stale connections about we tried to poll for data coming in from InfluxDB and timeout if it didn't repond in a timely manner. This introduced a race where the timeout triggers, a context switch occurs where data is actually available and the TlsStream spins trying to asynchronously notify that data is available, but which never gets read. Not only does this use up 100% of a core, but it also slowly starves the system of handler threads at which point metrics stop being delivered. This basically removes the poll and timeout, any TLS socket erros should be detected by TCP keep-alives. Fixes #5460 #5469 refs #5504

… attribute for 2.7.x refs #5460 refs #5469 refs #5504

dnsmichi self-requested a review August 16, 2017 10:12

dnsmichi added bug Something isn't working area/influxdb Metrics to InfluxDB labels Aug 16, 2017

dnsmichi assigned spjmurray Aug 16, 2017

dnsmichi approved these changes Aug 17, 2017

View reviewed changes

dnsmichi added this to the 2.7.1 milestone Aug 17, 2017

dnsmichi merged commit cb94b21 into Icinga:master Aug 17, 2017

dnsmichi pushed a commit that referenced this pull request Aug 17, 2017

InfluxdbWriter: Log a warning about soon-to-be-removed socket_timeout…

6c577d6

… attribute for 2.7.x refs #5460 refs #5469 refs #5504

dnsmichi added the backported Fix was included in a bugfix release label Aug 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TLS Race Connecting to InfluxDB #5504

Fix TLS Race Connecting to InfluxDB #5504

spjmurray commented Aug 14, 2017

spjmurray commented Aug 15, 2017

dnsmichi commented Aug 15, 2017 •

edited

Loading

spjmurray commented Aug 15, 2017

dnsmichi commented Aug 16, 2017

dnsmichi commented Aug 17, 2017

dnsmichi left a comment

Fix TLS Race Connecting to InfluxDB #5504

Fix TLS Race Connecting to InfluxDB #5504

Conversation

spjmurray commented Aug 14, 2017

spjmurray commented Aug 15, 2017

dnsmichi commented Aug 15, 2017 • edited Loading

spjmurray commented Aug 15, 2017

dnsmichi commented Aug 16, 2017

dnsmichi commented Aug 17, 2017

dnsmichi left a comment

Choose a reason for hiding this comment

dnsmichi commented Aug 15, 2017 •

edited

Loading