Icinga2 sends notification during reload (despite downtime) and does not log it #6846

ekeih · 2018-12-19T16:54:00Z

This is similar to #6057. It is even the exact same Icinga2 setup 😉

Expected Behavior

Icinga2 should not send a notification for a service that is in downtime (and active checks disabled) during a reload.
When Icinga2 sends a notification it should at least log it.

Current Behavior

We have two masters (master01 and master02) which are responsible for the notifications in our cluster.

Service has 7 configured contacts
Service is critical
All 7 contacts receive notifications
Service is in downtime (configured via API)
Active checks are disabled (configured via API)
Nothing changes for 30 days and Icinga2 reloads every 30 minutes during this time. I do not think that the number of days is relevant but it shows that it works most of the time.
Icinga2 reloads

# master01
...
[2018-12-15 20:26:05 +0100] information/ExternalCommandListener: 'command' stopped.
[2018-12-15 20:26:19 +0100] information/FileLogger: 'main-log' started.
[2018-12-15 20:26:22 +0100] information/ApiListener: 'api' started.
...

# master02
[2018-12-15 20:26:23 +0100] information/ApiListener: Restarting after configuration change.

20:26:43 master02 sends a notification to one contact
This should not happen because the service is still in downtime (and also the active checks are disabled). I guess it only happened for one contact because the other notification objects seem to be "scheduled" on master01.
Icinga2 does not log this notification, but our notification handler does. The notification is also not visible in Icingaweb2.
20:31:10 we acknowledge the alert
All 7 contacts receive a notification. This seems to be correct because all contacts received the critical alert 30 days ago.

Possible Solution

I guess that it is again some issue with the startup order or a race condition. But I have no specific idea.

Steps to Reproduce (for bugs)

It seems to happen at random times. I do not even know how often. Currently I only have this one example and no idea how to reproduce it.

Context

We have 2 masters and 6 satellites (4 zones, 2x2 and 2x1 satellites) and deploy our configuration via Puppet on master01 at xx:22. (In this case at 20:22.) Usually Icinga2 selects master02 as the IDO-DB master.

Your Environment

Version used (icinga2 --version): r2.10.2-1
Operating System and version: Ubuntu 16.04
Enabled features (icinga2 feature list): api checker command ido-mysql mainlog notification
Icinga Web 2 version and modules (System - About): 2.6.1
Config validation (icinga2 daemon -C):

[2018-12-19 17:40:51 +0100] information/cli: Icinga application loader (version: r2.10.2-1)
[2018-12-19 17:40:51 +0100] information/cli: Loading configuration file(s).
[2018-12-19 17:40:53 +0100] information/ConfigItem: Committing config item(s).
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Please read the upgrading documentation for v2.8: https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-12-19 17:40:53 +0100] information/ApiListener: My API identity: icinga2.live.master01
[2018-12-19 17:41:03 +0100] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 30.5333/s (1832/min 1832/5min 1832/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 636 ScheduledDowntimes.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 63109 Services.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 2523 Hosts.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 52704 Dependencies.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 49410 Notifications.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 339 HostGroups.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8603 Downtimes.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 522 Comments.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 6 Zones.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8 Endpoints.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8 ApiUsers.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 331 Users.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 14 CheckCommands.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 201 UserGroups.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ServiceGroup.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-12-19 17:41:10 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-12-19 17:41:10 +0100] information/cli: Finished validating the configuration file(s).

Let me know if you think that a detailed configuration of the zones end endpoints would be helpful or if I can do something to help.

The text was updated successfully, but these errors were encountered:

widhalmt · 2019-01-23T08:16:57Z

ref/NC/597231

ekeih · 2019-01-23T10:52:13Z

@widhalmt Any chance that the internal ticket contains any useful information to reproduce this issue or something else? Let me know if I can do anything to help with this problem.

widhalmt · 2019-01-23T12:12:50Z

@ekeih Sorry. All we got so far was that the setup in question hit the same problem. I'm still evaluating but I already escaleted the ticket internally so it might well be that someone else will come back to you for details.

ekeih · 2019-01-23T12:41:36Z

Thanks for the info! 🙂

Al2Klimov · 2019-01-25T12:16:54Z

Whiteboxtested v2.10.2 with the following result:

NotificationComponent#NotificationTimerHandler() considers Checkable#IsInDowntime() which looks for downtimes added via Checkable#RegisterDowntime() in Downtime#Start() on downtime activation. The present downtimes always register themselves successfully as the checkables are loaded before the downtimes and the downtimes are activated after loading. In addition NotificationComponents are activated after downtimes.

(Thanks to @Crunsher for consultation.)

TL;DR: The code seems clean. CC @lippserd

@ekeih Please could you provide debug logs around the time the notification was send? And: Are you sure that Icinga 2 calls the notification script?

SimonHoenscheid · 2019-01-28T12:14:04Z

We also see this behavior in our setup too.

lippserd · 2019-01-28T12:30:11Z

@SimonHoenscheid Do you have some debug logs and information about your setup?

ekeih · 2019-01-28T12:36:43Z

Please could you provide debug logs around the time the notification was send?

Currently not. Last time we used the debug log it grew extremely fast. So I am not able to enable it for several days in production.
I will try to gather more information in our testing environment but it's quite hard without a way to actively reproduce it.

And: Are you sure that Icinga 2 calls the notification script?

Yes, I am 100% sure that it is called by Icinga2.

jschanz · 2019-01-29T13:28:59Z

Found that issue, because we had the same problem in the past.
During a reload the notification script is sometimes triggered even though a downtime is set for a service.
This happens not for every service and it's not reproducible.

[2019-01-25 18:40:41 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 3, rate: 2495.82/s (149749/min 780477/5min 2344675/15min);
[2019-01-25 18:40:51 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 1, rate: 2505.15/s (150309/min 780335/5min 2344155/15min);
[2019-01-25 18:41:01 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 4, rate: 2514.7/s (150882/min 780287/5min 2347299/15min);
[2019-01-25 18:41:02 +0100] information/Application: Reload requested, letting new process take over.
[2019-01-25 18:41:02 +0100] information/ApiListener: 'api' stopped.
[2019-01-25 18:41:02 +0100] information/CheckerComponent: 'checker' stopped.
[2019-01-25 18:41:03 +0100] information/ExternalCommandListener: 'command' stopped.
[2019-01-25 18:42:01 +0100] information/FileLogger: 'main-log' started.
[2019-01-25 18:42:10 +0100] information/WorkQueue: #10 (DaemonCommand::Run) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2019-01-25 18:42:13 +0100] information/ApiListener: 'api' started.
[2019-01-25 18:42:13 +0100] information/ApiListener: Copying 3 zone configuration files for zone 'xx.yy.zz.de' to '/var/lib/icinga2/api/zones/xx.yy.zz.de'.
[2019-01-25 18:42:13 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/xx.yy.zz.de' (0 Bytes). Received timestamp '2019-01-25 18:42:13 +0100' (1548438133.171498), Current timestamp '2019-01-25 11:50:50 +0100' (1548413450.769627).
[2019-01-25 18:42:13 +0100] information/ApiListener: Copying 12 zone configuration files for zone 'director-global' to '/var/lib/icinga2/api/zones/director-global'.

Historie in Icingweb2 is empty for that timestamp ...

... SMS-Gateway shows Icinga has triggered the notification.

Systemd logs something like this

Jan 25 18:41:02 localhost.localdomain systemd[1]: icinga2.service: Supervising process 32212 which is not our child. We'll most likely not notice when it exits.

We're running a single master with several zones. So I think it's not a cluster problem ...

jschanz · 2019-01-29T13:53:34Z

Another idea ... Is it possible, that the newly spawn Icinga2 process will send this notification while starting up? Because ACKs sent via API are shown in Icingaweb2:

The notification script automatically sends an ACK to the API for the triggered service:

if ( $response->is_success ) {
  print $response->decoded_content . "\n";
  my $retval = $response->decoded_content;
  $post_data = "\"type\": \"Host\", \"filter\": \"host.name==\\\"$icinga2_host_name\\\"\", \"author\": \"Notification-Manager\", \"comment\": \"Es wurde erfolgreich eine SMS an $smsgw_pager ($icinga2_user_display_name) versendet. Status der SMS: $retval.\", \"notify\": false, \"sticky\": true";
  my $command = "curl -u '$icinga2_api_user:$icinga2_api_password' -H 'Accept: application/json' -X POST -k '$endpoint_url' -d '{ $post_data}'";
  exec($command);
} else {
  my $retval = $response->decoded_content;
  my $retmsg = $response->status_line;
  $post_data = "\"type\": \"Host\", \"filter\": \"host.name==\\\"$icinga2_host_name\\\"\", \"author\": \"Notification-Manager\", \"comment\": \"Es trat ein Fehler beim versenden der SMS an $smsgw_pager ($icinga2_user_display_name) auf. Status der SMS: $retval ($retmsg).\", \"notify\": false, \"sticky\": true";
  my $command = "curl -u '$icinga2_api_user:$icinga2_api_password' -H 'Accept: application/json' -X POST -k '$endpoint_url' -d '{ $post_data}'";
  exec($command);

  print $response->decoded_content . "\n";
  die $response->status_line . "\n";
}

So the API accepts and process connections and events, but logs nothing to /var/log/icinga2/icinga2.log.

The reload on 25th at ~ 18:40 was triggered by Director:

As I understand the state of Icinga2 is written to
/var/lib/icinga2/icinga2.state during a restart.

So ... Which Icinga2-Core receives the API-Call during reload in that case? The old or the new one? Why is an API-Call persisten, but nothing is logged or written to ido?
Any ideas?

Update:

After digging through the notificationcomponent code as @Al2Klimov described, I found, that the following isn't never logged to the /var/log/icinga2/icinga2.log in our environments:

void NotificationComponent::Stop(bool runtimeRemoved)
{
	Log(LogInformation, "NotificationComponent")
		<< "'" << GetName() << "' stopped.";

	ObjectImpl<NotificationComponent>::Stop(runtimeRemoved);
}

With a

systemctl restart icinga2

only the startup of the component is logged:

[2019-01-29 15:11:45 +0100] information/NotificationComponent: 'notification' started.

A reload with systemd or director also brings up the NotificationComponent message in the logfile.
At January, 25th, the NotificationComponent was also started about a minute after the "notification" was triggered.

[2019-01-25 18:42:14 +0100] information/NotificationComponent: 'notification' started.

An "NotificationComponent ... stopped" event, which should be thrown during shutdown as I understand, wasn't also logged in that case.

I assume something goes wrong during shutdown, which leads to a trigger that sends notifications out.

Al2Klimov · 2019-01-29T14:12:41Z

During reload the daemon starts a copy of itself and then shuts down. There's a small period of time during which any of them could receive the request. At the moment log messages about API requests may be lost, but #6827 would fix that.

SimonHoenscheid · 2019-01-30T12:51:49Z

@SimonHoenscheid Do you have some debug logs and information about your setup?

I can sit down next week and put everything together. regarding debug logs, I need to have a look of the impacts this will have on our system.

marcofl · 2019-02-11T08:46:43Z

During a reload the notification script is sometimes triggered even though a downtime is set for a service.

We see the same issue:

At 03:08 there was a Icinga2 reload, but nothing was logged about a notification:

[2019-02-11 03:09:50 +0100] information/FileLogger: 'main-log' started.
[2019-02-11 03:09:56 +0100] information/ApiListener: 'api' started.
[2019-02-11 03:09:57 +0100] information/ExternalCommandListener: 'command' started.
[2019-02-11 03:09:57 +0100] information/GelfWriter: 'gelf' started.
[2019-02-11 03:09:57 +0100] information/InfluxdbWriter: 'influxdb' started.
[2019-02-11 03:09:57 +0100] information/CheckerComponent: 'checker' started.
[2019-02-11 03:09:57 +0100] information/NotificationComponent: 'notification' started.
[2019-02-11 03:09:57 +0100] information/DbConnection: 'ido-mysql' started.

what is even more strange:

graylog logged a notification at 02:04

This does not make any sense, right? when the service was critical, graylog logged the notification but it was not actually triggered. during the reload at 03:08 the / a notification was triggered but not logged to graylog?

Al2Klimov · 2019-03-07T17:31:42Z

@ekeih You could auto-trucate the debug log by a cron script not to run out of disk space.

MaBauMeBad · 2019-06-18T10:22:32Z

Same here. If any addtional Logs or Information are required, please feel free to ask.

Thanks

ekeih · 2019-06-18T10:33:54Z

@MaBauMeBad I think the main reason that there is no progress is that nobody provided debug logs. I currently do not have the resources to do so.
So if you can provide debug logs of this issue, this would probably help ;)

MarcusCaepio · 2019-08-26T15:03:12Z

Hi all,
just wanted to confirm this. Had it right now. Scenario:
Scheduled Downtimes of several Hosts, which are off. Edited the Icinga2 Config of other hosts, reloaded the config and got notifications of the downtimed hosts.

Cheers,
Marcus

dnsmichi · 2019-08-27T07:51:46Z

Please test this with 2.11 RC - https://icinga.com/2019/07/25/icinga-2-11-release-candidate/

dnsmichi · 2019-09-17T15:29:49Z

I consider this fixed with 2.11 from the many patches we've implemented in this region.

ekeih changed the title ~~Icinga2 sends notification during reload and does not log it~~ Icinga2 sends notification during reload (despite downtime) and does not log it Dec 19, 2018

dnsmichi added the area/notifications Notification events label Dec 20, 2018

Al2Klimov self-assigned this Jan 25, 2019

Al2Klimov assigned ekeih and unassigned Al2Klimov Jan 25, 2019

Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Jan 25, 2019

Al2Klimov unassigned ekeih Jan 29, 2019

Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Jan 29, 2019

dnsmichi added the needs feedback We'll only proceed once we hear from you again label Feb 11, 2019

Al2Klimov assigned ekeih Mar 7, 2019

dnsmichi added the ref/NC label Apr 25, 2019

dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels Jun 19, 2019

dnsmichi unassigned ekeih Jun 19, 2019

dnsmichi added the needs feedback We'll only proceed once we hear from you again label Aug 27, 2019

dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Sep 17, 2019

dnsmichi closed this as completed Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icinga2 sends notification during reload (despite downtime) and does not log it #6846

Icinga2 sends notification during reload (despite downtime) and does not log it #6846

ekeih commented Dec 19, 2018

widhalmt commented Jan 23, 2019

ekeih commented Jan 23, 2019

widhalmt commented Jan 23, 2019

ekeih commented Jan 23, 2019

Al2Klimov commented Jan 25, 2019

SimonHoenscheid commented Jan 28, 2019

lippserd commented Jan 28, 2019

ekeih commented Jan 28, 2019 •

edited

Loading

jschanz commented Jan 29, 2019 •

edited

Loading

jschanz commented Jan 29, 2019 •

edited

Loading

Al2Klimov commented Jan 29, 2019

SimonHoenscheid commented Jan 30, 2019

marcofl commented Feb 11, 2019 •

edited

Loading

Al2Klimov commented Mar 7, 2019

MaBauMeBad commented Jun 18, 2019

ekeih commented Jun 18, 2019 •

edited

Loading

MarcusCaepio commented Aug 26, 2019

dnsmichi commented Aug 27, 2019

dnsmichi commented Sep 17, 2019

Icinga2 sends notification during reload (despite downtime) and does not log it #6846

Icinga2 sends notification during reload (despite downtime) and does not log it #6846

Comments

ekeih commented Dec 19, 2018

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

widhalmt commented Jan 23, 2019

ekeih commented Jan 23, 2019

widhalmt commented Jan 23, 2019

ekeih commented Jan 23, 2019

Al2Klimov commented Jan 25, 2019

SimonHoenscheid commented Jan 28, 2019

lippserd commented Jan 28, 2019

ekeih commented Jan 28, 2019 • edited Loading

jschanz commented Jan 29, 2019 • edited Loading

jschanz commented Jan 29, 2019 • edited Loading

Al2Klimov commented Jan 29, 2019

SimonHoenscheid commented Jan 30, 2019

marcofl commented Feb 11, 2019 • edited Loading

Al2Klimov commented Mar 7, 2019

MaBauMeBad commented Jun 18, 2019

ekeih commented Jun 18, 2019 • edited Loading

MarcusCaepio commented Aug 26, 2019

dnsmichi commented Aug 27, 2019

dnsmichi commented Sep 17, 2019

ekeih commented Jan 28, 2019 •

edited

Loading

jschanz commented Jan 29, 2019 •

edited

Loading

jschanz commented Jan 29, 2019 •

edited

Loading

marcofl commented Feb 11, 2019 •

edited

Loading

ekeih commented Jun 18, 2019 •

edited

Loading