Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11806] Checks are not executed anymore on command #4224

Closed
icinga-migration opened this issue May 18, 2016 · 26 comments
Closed
Labels
bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11806

Created by hostedpower on 2016-05-18 12:00:10 +00:00

Assignee: mfrosch
Status: Resolved (closed on 2016-05-19 11:25:04 +00:00)
Target Version: 2.4.9
Last Update: 2016-05-19 18:19:56 +00:00 (in Redmine)

Icinga Version: r2.4.8-1
Backport?: Not yet backported
Include in Changelog: 1

Hi,

Since latest release some checks don't seem to be executed anymore. I had apt upgrade pending, I upgraded systems and since then the status is not updated.

Clicking onto: "Check now" on the service in the web gui does not seem to do anything.

In the end I restarted the icinga2 daemon on the monitoring server, I went back to the web gui, I clicked on "Check now" and finally they were checked.

A few hours later I try the same for another host, but it fails again. Restarting icinga2 on the monitoring server fixes it again :|

I'm not sure if it could be related to this new feature: https://dev.icinga.org/issues/8137

Kind regards
Jo

Changesets

2016-05-18 12:30:36 +00:00 by gbeutner b99b373

Fix 100% CPU usage issue and incorrect pending checks accounting in CheckerComponent::CheckThreadProc

fixes #11806

2016-05-19 11:15:00 +00:00 by gbeutner 232c299

Fix 100% CPU usage issue and incorrect pending checks accounting in CheckerComponent::CheckThreadProc

fixes #11806

Relations:

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-18 12:02:40 +00:00

  • Relates set to 8137

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-18 12:21:26 +00:00

CPU spikes also now on the monitoring server. It's the icinga2 consuming all the CPU.

top - 14:20:07 up 19 days, 23:18, 2 users, load average: 2.72, 2.72, 2.79
Tasks: 88 total, 1 running, 87 sleeping, 0 stopped, 0 zombie
%Cpu(s): 27.0 us, 45.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.6 si, 26.6 st
KiB Mem: 2015544 total, 1871788 used, 143756 free, 128168 buffers
KiB Swap: 2097148 total, 63716 used, 2033432 free. 657496 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
814 nagios 20 0 1134408 73096 21852 S 158.0 3.6 37:51.34 icinga2
3 root 20 0 0 0 0 S 32.5 0.0 1438:32 ksoftirqd/0
4758 mysql 20 0 1675192 420764 4216 S 3.3 20.9 328:35.82 mysqld
26022 www-data 20 0 415112 41044 23364 S 3.0 2.0 0:49.73 apache2
17425 n2 20 0 15276 2704 1548 S 2.3 0.1 71:41.06 n2txd
97 root 20 0 0 0 0 S 0.3 0.0 15:43.13 jbd2/xvda1-8
1 root 20 0 28848 4216 2652 S 0.0 0.2 0:26.29 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-18 12:28:50 +00:00

  • Category set to Checker
  • Status changed from New to Assigned
  • Assigned to set to gbeutner
  • Target Version set to 2.4.9

Are you using command_endpoint?

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-18 12:31:59 +00:00

Nvm, I think I found the problem. Can you test whether this still happens with the latest master branch?

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-18 12:35:05 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset b99b373.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-18 12:39:33 +00:00

Hi,

I probably need to compile from source? It's a production server only using Debian packages. I'm not sure what the easiest way to get a stable situation again :(

Jo

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-18 12:43:43 +00:00

PS: from what I can see, many many checks are not working atm. Many checks should have been executed again but are not.

I think It's so severe a bugfix should be sent out trough apt packages as well asap ...

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-18 12:55:14 +00:00

PS PS: if this new setting is not used, I assume it works like before? Does this mean there is no limit or how does it work? Most checks we use are indeed on the remote servers itself (using command_endpoint).

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-18 12:58:11 +00:00

We're going to release a fix for this tomorrow (i.e. 2.4.9). As a temporary workaround you can set the concurrent_checks option to a fairly large number (e.g. 4294967294).

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-18 13:14:03 +00:00

Hi,

Thanks a lot for the quick response and fix.

I assume 4294967294 is too high since it didn't work. Setting it to 10000 seems to work for now, but the CPU usage is still quite high. I assume the new version fixed that as well?

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-18 13:38:01 +00:00

Just want to let you know that cpu usage keeps increasing and increasing even with concurrent_checks = 10000:

top - 15:34:57 up 20 days, 33 min, 2 users, load average: 2.55, 1.62, 1.31
Tasks: 90 total, 2 running, 88 sleeping, 0 stopped, 0 zombie
%Cpu(s): 32.3 us, 67.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.6 st
KiB Mem: 2015544 total, 1929224 used, 86320 free, 139976 buffers
KiB Swap: 2097148 total, 63312 used, 2033836 free. 664128 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3272 nagios 20 0 1134928 77712 21928 S 158.4 3.9 25:22.40 icinga2
3 root 20 0 0 0 0 R 39.5 0.0 1447:26 ksoftirqd/0
9 root rt 0 0 0 0 S 5.6 0.0 0:32.10 migration/0

Sorry that I cannot apply the fix for now, just want to be sure you took this into account as well with the fix coming tomorrow :)

Wouldn't it be a good idea to keep the latest and one older version available in the debmon project packages? Not sure who's responsible for this, just a thought. Atm I cannot reverse to 2.4.7 afaik since the older packages are purged from debmon already.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-18 13:38:25 +00:00

  • Duplicated set to 11808

@icinga-migration
Copy link
Author

Updated by mfrosch on 2016-05-18 18:53:30 +00:00

  • Status changed from Resolved to Feedback
  • Priority changed from Normal to Urgent

On v2.4.8-413-g4af6bde

After about an hour, all checks freeze, no apparent CPU usage

@icinga-migration
Copy link
Author

Updated by mfrosch on 2016-05-18 18:53:44 +00:00

  • Status changed from Feedback to New

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-19 05:05:13 +00:00

  • Status changed from New to Feedback
  • Assigned to changed from gbeutner to mfrosch

That's not the latest commit. Please test with b99b373 or later.

@icinga-migration
Copy link
Author

Updated by mfrosch on 2016-05-19 07:46:21 +00:00

Had to trigger Jenkins, watching my Testsystem now.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-19 08:44:39 +00:00

Checks halted here too, CPU usage very high as well :( I hope the fix solves this all :)

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-19 11:25:04 +00:00

  • Status changed from Feedback to Resolved

Applied in changeset 232c299.

@icinga-migration
Copy link
Author

Updated by gbeutner on 2016-05-19 11:31:10 +00:00

  • Relates set to 11812

@icinga-migration
Copy link
Author

Updated by a2yp on 2016-05-19 12:24:54 +00:00

We still have this Problem with Versions:

  • 2.4.9-1~debmon8+1
  • 2.4.9-1~debmon70+1

=> Checks are not performed after some minutes
=> High CPU Load

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-19 13:04:42 +00:00

Same here , issues not fixed at all! Since 12:45 checks are not refreshed. This is terrible :(

I would like to go back to 2.4.7 asap.

@icinga-migration
Copy link
Author

Updated by a2yp on 2016-05-19 13:12:56 +00:00

hostedpower wrote:

I would like to go back to 2.4.7 asap.

Unfortunately, this version is no longer available in debmon.

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-19 13:15:07 +00:00

Indeed, it would be better to keep some versions there :(

@icinga-migration
Copy link
Author

Updated by Isotop7 on 2016-05-19 14:28:36 +00:00

a2yp wrote:

We still have this Problem with Versions:

  • 2.4.9-1~debmon8+1
  • 2.4.9-1~debmon70+1

=> Checks are not performed after some minutes
=> High CPU Load

same here...

IMHO this isnt resolved.
Cant use checks and Grafana graphing since yesterday...

EDIT: version 2.4.10 seems to fix it for me! The daemon is running for 10 minutes straight with no problems whatsoever.

@icinga-migration
Copy link
Author

Updated by a2yp on 2016-05-19 15:04:21 +00:00

Version 2.4.10 seems to fix this issue. Here Icinga runs without problems for about 40 minutes.

THX

@icinga-migration
Copy link
Author

Updated by hostedpower on 2016-05-19 18:19:56 +00:00

Seems like solved with 2.4.10 as well here! Did you revert all changes or simply fixed the issues somehow? :)

@icinga-migration icinga-migration added Urgent bug Something isn't working labels Jan 17, 2017
@icinga-migration icinga-migration added this to the 2.4.9 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant