-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to connect to InfluxDB increases CPU utilisation by 100% for every failure #5469
Comments
What happens if you disable ssl? Could be the same issue as #5460. |
At a first glance, disabling SSL will not trigger the warning (yet ?) and at least I can see writes to the database:
The performance data are written:
I will let it run for a day and see if it works. If it does the problem can be pinpointed to the combination of InfluxDB + SSL. |
Thanks, that helps a lot. Keep going 👍 Meanwhile the discussion seems to go the same route in #5460 .. let's see where to head at. @spjmurray - just a quick ping, any ideas on SSL with InfluxDB? |
I will covertly take a day off from openstack concerns and have a look at this tomorrow @dnsmichi 😉 Always happy to do a bit of C++ instead of python |
Icinga 2.7.0 + InfluxDB without SSL seem to work. It runs for 2 day with a maximal peak in CPU utilisation of about 1.2%. Read: Normal ;) |
@spjmurray cool, thanks :) @Bobobo-bo-Bo-bobo thanks for your feedback, much appreciated :) |
I've recreated no problem in staging, |
Rather than leaving stale connections about we tried to poll for data coming in from InfluxDB and timeout if it didn't repond in a timely manner. This introduced a race where the timeout triggers, a context switch occurs where data is actually available and the TlsStream spins trying to asynchronously notify that data is available, but which never gets read. Not only does this use up 100% of a core, but it also slowly starves the system of handler threads at which point metrics stop being delivered. This basically removes the poll and timeout, any TLS socket erros should be detected by TCP keep-alives. Fixes Icinga#5460 Icinga#5469
Rather than leaving stale connections about we tried to poll for data coming in from InfluxDB and timeout if it didn't repond in a timely manner. This introduced a race where the timeout triggers, a context switch occurs where data is actually available and the TlsStream spins trying to asynchronously notify that data is available, but which never gets read. Not only does this use up 100% of a core, but it also slowly starves the system of handler threads at which point metrics stop being delivered. This basically removes the poll and timeout, any TLS socket erros should be detected by TCP keep-alives. Fixes #5460 #5469 refs #5504
Please test the snapshot packages, or the referenced patch with your setups. Thanks. |
New binaries (build from commit d075665) are now working fine for about 12 hours. ;-) |
Cool, thanks for your fast feedback 👍 |
Since the upgrade to 2.7.0 after a restart or shortly after a reload the connection to the InfluxDB fails and the CPU load of the
icinga2
process increases by 100%Expected Behavior
Performance data will be written to InfluxDB, CPU utilisation at a "sane" level
Current Behavior
Shortly after the restart
icinga2
fails to connect to the InfluxDB instance (see #5460 ):[2017-08-07 12:29:15 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'fd33:64e1:dd0d:9d95:4e7:1663:e620:839f' port '8086'.
Immediately after the warning the
icinga2
processes uses up to 100% CPU for everyInfluxdbWriter
warning:Results in:
At this point the number of items in the InfluxdbWrite queue will increase and never flushed:
(X axis: time, Y axis: number of items in
InfluxdbWriter
queue)InfluxDB Writer configuration:
Possible Solution
No known solution except downgrading Icinga2 to 2.6.x or disabling the InfluxDB writer.
Steps to Reproduce (for bugs)
icinga2
InfluxdbWriter
to failYour Environment
icinga2 --version
):icinga2 feature list
):icinga2 daemon -C
):The text was updated successfully, but these errors were encountered: