Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to connect to InfluxDB increases CPU utilisation by 100% for every failure #5469

Closed
ghost opened this issue Aug 7, 2017 · 10 comments
Assignees
Labels
area/influxdb Metrics to InfluxDB bug Something isn't working
Milestone

Comments

@ghost
Copy link

ghost commented Aug 7, 2017

Since the upgrade to 2.7.0 after a restart or shortly after a reload the connection to the InfluxDB fails and the CPU load of the icinga2 process increases by 100%

Expected Behavior

Performance data will be written to InfluxDB, CPU utilisation at a "sane" level

Current Behavior

Shortly after the restart icinga2 fails to connect to the InfluxDB instance (see #5460 ):

[2017-08-07 12:29:15 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'fd33:64e1:dd0d:9d95:4e7:1663:e620:839f' port '8086'.

Immediately after the warning the icinga2 processes uses up to 100% CPU for every InfluxdbWriter warning:

[...]
[2017-08-07 19:05:56 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'fd33:64e1:dd0d:9d95:4e7:1663:e620:839f' port '8086'.
[2017-08-07 19:06:47 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'fd33:64e1:dd0d:9d95:4e7:1663:e620:839f' port '8086'.
[2017-08-07 19:07:08 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'fd33:64e1:dd0d:9d95:4e7:1663:e620:839f' port '8086'.
[...]

Results in:

root@debian9164m ~ # top -n 1 -b -u nagios -c
top - 19:10:27 up 39 days,  6:08, 17 users,  load average: 3.12, 1.89, 1.10
Tasks: 341 total,   1 running, 340 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.8 us,  1.2 sy,  1.3 ni, 90.6 id,  1.9 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65806728 total,  2117476 free, 28213344 used, 35475908 buff/cache
KiB Swap: 33521660 total, 33521228 free,      432 used. 35839396 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28392 nagios    20   0 2382000  38312  21696 S 293.8  0.1  11:36.80 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon
28418 nagios    20   0  526240   5868   4560 S   0.0  0.0   0:00.24 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon

At this point the number of items in the InfluxdbWrite queue will increase and never flushed:

[2017-08-07 19:11:34 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 905, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 2 minutes and 9 seconds
[2017-08-07 19:11:44 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 993, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 1 minute and 52 seconds
[2017-08-07 19:11:54 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1042, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 3 minutes and 32 seconds
[2017-08-07 19:12:04 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1066, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 7 minutes and 24 seconds
[2017-08-07 19:12:14 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1086, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 9 minutes and 2 seconds
[2017-08-07 19:12:24 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1118, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 5 minutes and 49 seconds
[2017-08-07 19:12:34 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1184, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 2 minutes and 59 seconds
[2017-08-07 19:12:44 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1272, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 2 minutes and 24 seconds
[2017-08-07 19:12:54 +0200] information/WorkQueue: #8 (InfluxdbWriter, influxdb) items: 1312, rate: 4.41667/s (265/min 1401/5min 1401/15min); empty in 5 minutes and 27 seconds

influxdb_backlog

(X axis: time, Y axis: number of items in InfluxdbWriter queue)

InfluxDB Writer configuration:

root@debian9164m ~ # cat /etc/icinga2/features-available/influxdb.conf
/**
 * The InfluxdbWriter type writes check result metrics and
 * performance data to an InfluxDB HTTP API
 */

library "perfdata"

object InfluxdbWriter "influxdb" {
  host = "fd33:64e1:dd0d:9d95:4e7:1663:e620:839f"
  port = 8086
  database = "icinga2"
  username = "icinga2"
  password = "ItsSoFluffyImGONNADIE!SO_FLUFFY!"
  ssl_enable = true
  ssl_ca_cert = "/etc/ssl/certs/ca-ypbind.de.pem"
  host_template = {
    measurement = "$host.check_command$"
    tags = {
      hostname = "$host.name$"
    }
  }
  service_template = {
    measurement = "$service.check_command$"
    tags = {
      hostname = "$host.name$"
      service = "$service.name$"
    }
  }
}

Possible Solution

No known solution except downgrading Icinga2 to 2.6.x or disabling the InfluxDB writer.

Steps to Reproduce (for bugs)

  1. (Re-)Start icinga2
  2. Wait for the InfluxdbWriter to fail

Your Environment

  • Version used (icinga2 --version):
root@debian9164m ~ # icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.7.0-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-3-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: debian9
  • Operating System and version: Debian 9, x86_64
  • Enabled features (icinga2 feature list):
root@debian9164m ~ # icinga2 feature list
Disabled features: compatlog debuglog gelf graphite opentsdb perfdata statusdata
Enabled features: api checker command ido-pgsql influxdb livestatus mainlog notification syslog
  • Config validation (icinga2 daemon -C):
root@debian9164m ~ # icinga2 daemon -C
information/cli: Icinga application loader (version: r2.7.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
information/ApiListener: My API identity: debian9164m.ypbind.de
warning/ApplyRule: Apply rule 'smtp' (in /etc/icinga2/zones.d/global-configuration/services/local_smtp.conf: 1:0-1:19) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'remote_mdraid_mismatch' (in /etc/icinga2/zones.d/global-configuration/services/mdraid_mismatch.conf: 1:0-1:37) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'check_nrpe' (in /etc/icinga2/zones.d/global-configuration/services/nrpe.conf: 1:0-1:67) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'opentracker' (in /etc/icinga2/zones.d/global-configuration/services/opentracker.conf: 1:0-1:26) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'saltminion' (in /etc/icinga2/zones.d/global-configuration/services/salt_minion.conf: 1:0-1:25) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'xmpp' (in /etc/icinga2/zones.d/global-configuration/services/xmpp.conf: 1:0-1:19) for type 'Service' does not match anywhere!
information/ConfigItem: Instantiated 4 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 1 SyslogLogger.
information/ConfigItem: Instantiated 11 Zones.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 10 Endpoints.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 277 Notifications.
information/ConfigItem: Instantiated 228 CheckCommands.
information/ConfigItem: Instantiated 14 Hosts.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 2 HostGroups.
information/ConfigItem: Instantiated 1 UserGroup.
information/ConfigItem: Instantiated 3 TimePeriods.
information/ConfigItem: Instantiated 1 User.
information/ConfigItem: Instantiated 263 Services.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
information/ConfigItem: Instantiated 1 InfluxdbWriter.
information/ConfigItem: Instantiated 1 LivestatusListener.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 8, 2017

What happens if you disable ssl? Could be the same issue as #5460.

@dnsmichi dnsmichi added bug Something isn't working area/influxdb Metrics to InfluxDB needs feedback We'll only proceed once we hear from you again labels Aug 8, 2017
@ghost
Copy link
Author

ghost commented Aug 8, 2017

At a first glance, disabling SSL will not trigger the warning (yet ?) and at least I can see writes to the database:

[I] 2017-08-08T17:57:54Z /var/run/influxdb/data/icinga2/autogen/2 database index loaded in 5.058µs service=shard
[httpd] fd33:64e1:dd0d:9d95:4e7:1663:e620:839f - icinga2 [08/Aug/2017:19:57:54 +0200] "POST /write?db=icinga2&p=%5BREDACTED%5D&precision=s&u=icinga2 HTTP/1.1" 204 0 "-" "Icinga/r2.7.0-1" 1aae83b5-7c63-11e7-8007-000000000000 5163
[httpd] fd33:64e1:dd0d:9d95:4e7:1663:e620:839f - icinga2 [08/Aug/2017:19:57:56 +0200] "POST /write?db=icinga2&p=%5BREDACTED%5D&precision=s&u=icinga2 HTTP/1.1" 204 0 "-" "Icinga/r2.7.0-1" 1b9ca325-7c63-11e7-8008-000000000000 6353
[httpd] fd33:64e1:dd0d:9d95:4e7:1663:e620:839f - icinga2 [08/Aug/2017:19:58:06 +0200] "POST /write?db=icinga2&p=%5BREDACTED%5D&precision=s&u=icinga2 HTTP/1.1" 204 0 "-" "Icinga/r2.7.0-1" 21939da7-7c63-11e7-8009-000000000000 2688
[httpd] fd33:64e1:dd0d:9d95:4e7:1663:e620:839f - icinga2 [08/Aug/2017:19:58:16 +0200] "POST /write?db=icinga2&p=%5BREDACTED%5D&precision=s&u=icinga2 HTTP/1.1" 204 0 "-" "Icinga/r2.7.0-1" 278a1671-7c63-11e7-800a-000000000000 1162

The performance data are written:

> select * from procs where hostname='mqtt-bridge.internal'
name: procs
time                hostname                metric service      value
----                --------                ------ -------      -----
1502214978000000000 mqtt-bridge.internal procs  remote_procs 106
1502215038000000000 mqtt-bridge.internal procs  remote_procs 104
1502215098000000000 mqtt-bridge.internal procs  remote_procs 104
1502215158000000000 mqtt-bridge.internal procs  remote_procs 104

I will let it run for a day and see if it works. If it does the problem can be pinpointed to the combination of InfluxDB + SSL.

@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 9, 2017

Thanks, that helps a lot. Keep going 👍 Meanwhile the discussion seems to go the same route in #5460 .. let's see where to head at. @spjmurray - just a quick ping, any ideas on SSL with InfluxDB?

@spjmurray
Copy link
Contributor

I will covertly take a day off from openstack concerns and have a look at this tomorrow @dnsmichi 😉 Always happy to do a bit of C++ instead of python

@ghost
Copy link
Author

ghost commented Aug 11, 2017

Icinga 2.7.0 + InfluxDB without SSL seem to work. It runs for 2 day with a maximal peak in CPU utilisation of about 1.2%. Read: Normal ;)

@dnsmichi
Copy link
Contributor

@spjmurray cool, thanks :)

@Bobobo-bo-Bo-bobo thanks for your feedback, much appreciated :)

@spjmurray
Copy link
Contributor

I've recreated no problem in staging, perf top shows some form of threading deadlock. I'm poking about with gbd to find the culprit now

@dnsmichi dnsmichi added this to the 2.7.1 milestone Aug 11, 2017
spjmurray added a commit to spjmurray/icinga2 that referenced this issue Aug 14, 2017
Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner.  This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read.  Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.

This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.

Fixes Icinga#5460 Icinga#5469
@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Aug 16, 2017
dnsmichi pushed a commit that referenced this issue Aug 17, 2017
Rather than leaving stale connections about we tried to poll for data coming in
from InfluxDB and timeout if it didn't repond in a timely manner.  This introduced
a race where the timeout triggers, a context switch occurs where data is actually
available and the TlsStream spins trying to asynchronously notify that data is
available, but which never gets read.  Not only does this use up 100% of a core,
but it also slowly starves the system of handler threads at which point metrics
stop being delivered.

This basically removes the poll and timeout, any TLS socket erros should be
detected by TCP keep-alives.

Fixes #5460 #5469

refs #5504
dnsmichi pushed a commit that referenced this issue Aug 17, 2017
@dnsmichi
Copy link
Contributor

Please test the snapshot packages, or the referenced patch with your setups. Thanks.

@ghost
Copy link
Author

ghost commented Aug 18, 2017

New binaries (build from commit d075665) are now working fine for about 12 hours. ;-)

@dnsmichi
Copy link
Contributor

Cool, thanks for your fast feedback 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb Metrics to InfluxDB bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants