Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random crashes/loops of satellites in HA setup when new config is deployed #8060

Closed
Mikesch-mp opened this issue Jun 16, 2020 · 3 comments
Closed
Labels
area/configuration DSL, parser, compiler, error handling area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention
Milestone

Comments

@Mikesch-mp
Copy link
Contributor

Describe the bug

When deploying a new configuration to all satellites via systemctl reload icinga2 one of the satellite or both in a zone going into a reload/config stage check loop because boost is crashing all. I maybe think this is because of a timediff between masters and the satellites itself in a range of a few milliseconds.

[2020-06-16 09:59:56 +0200] critical/ThreadPool: Exception thrown in event handler:
Error: boost::filesystem::remove: Directory not empty: "/var/lib/icinga2/api/zones/global-templates/_etc/services"


        (0) libboost_filesystem.so.1.69.0: <unknown function> (+0x8ebb) [0x7f317236bebb]
        (1) libboost_filesystem.so.1.69.0: <unknown function> (+0xb90c) [0x7f317236e90c]
        (2) libboost_filesystem.so.1.69.0: <unknown function> (+0xbaee) [0x7f317236eaee]
        (3) libboost_filesystem.so.1.69.0: <unknown function> (+0xbaee) [0x7f317236eaee]
        (4) libboost_filesystem.so.1.69.0: <unknown function> (+0xbaee) [0x7f317236eaee]
        (5) libboost_filesystem.so.1.69.0: boost::filesystem::detail::remove_all(boost::filesystem::path const&, boost::system::error_code*) (+0xaf) [0x7f317236efef]
        (6) icinga2: icinga::Utility::RemoveDirRecursive(icinga::String const&) (+0x7e) [0x764f7e]
        (7) icinga2: icinga::ApiListener::TryActivateZonesStageCallback(icinga::ProcessResult const&, std::vector<icinga::String, std::allocator<icinga::String> > const&) (+0x4ca) [0x9486aa]
        (8) /usr/lib64/icinga2/sbin/icinga2() [0x8fefd3]
        (9) icinga2: boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::function<void ()> >(std::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::allocator<void>*, boost::system::error_code const&, unsigned long) (+0x106) [0xbf7776]
        (10) /usr/lib64/icinga2/sbin/icinga2() [0x630e91]
        (11) /usr/lib64/icinga2/sbin/icinga2() [0x6311e2]
        (12) icinga2: boost_asio_detail_posix_thread_function (+0xf) [0x8211bf]
        (13) libpthread.so.0: <unknown function> (+0x7e65) [0x7f317024ae65]
        (14) libc.so.6: clone (+0x6d) [0x7f316ff7388d]

To Reproduce

i cant reproduce it all the time.

Expected behavior

Icinga should not crash or stay in a loop while reloading configuration on satellites, also config changes should not be checked against a timestamp, better on a hash change of the configuration.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): 2.11.3-1
  • Operating System and version: CentOS 7.7.1908
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker mainlog
  • Config validation (icinga2 daemon -C):
[2020-06-16 10:35:31 +0200] information/cli: Icinga application loader (version: 2.11.3-1)
[2020-06-16 10:35:31 +0200] information/cli: Loading configuration file(s).
[2020-06-16 10:35:31 +0200] information/ConfigItem: Committing config item(s).
[2020-06-16 10:35:31 +0200] information/ApiListener: My API identity: vm32283.psmanaged.com
[2020-06-16 10:35:31 +0200] warning/ApplyRule: Apply rule '' (in /var/lib/icinga2/api/zones/global-templates/_etc/services/icmp.conf: 1:0-1:58) for type 'Service' does not match anywhere!
[2020-06-16 10:35:31 +0200] warning/ApplyRule: Apply rule 'SMTP Status' (in /var/lib/icinga2/api/zones/global-templates/_etc/services/smtp.conf: 1:0-1:26) for type 'Service' does not match anywhere!
[2020-06-16 10:35:31 +0200] warning/ApplyRule: Apply rule 'Status Icinga Cluster' (in /var/lib/icinga2/api/zones/global-templates/_etc/services/status_icinga_cluster.conf: 1:0-1:36) for type 'Service' does not match anywhere!
[2020-06-16 10:35:31 +0200] warning/ApplyRule: Apply rule 'Status IDO' (in /var/lib/icinga2/api/zones/global-templates/_etc/services/status_ido.conf: 1:0-1:25) for type 'Service' does not match anywhere!
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 3 HostGroups.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 5 Hosts.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 53 Dependencies.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 6 Zones.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 6 Endpoints.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 1 ApiUser.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 210 CheckCommands.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 5 TimePeriods.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 1 User.
[2020-06-16 10:35:31 +0200] information/ConfigItem: Instantiated 79 Services.
[2020-06-16 10:35:31 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2020-06-16 10:35:31 +0200] information/cli: Finished validating the configuration file(s).
@Mikesch-mp
Copy link
Contributor Author

Also happens with

icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.4-1)
[2020-06-29 10:49:02 +0200] critical/ThreadPool: Exception thrown in event handler:
Error: boost::filesystem::remove: Directory not empty: "/var/lib/icinga2/api/zones/global-templates/_etc/services"


        (0) libboost_filesystem.so.1.69.0: <unknown function> (+0x8ebb) [0x7fa9a4a6aebb]
        (1) libboost_filesystem.so.1.69.0: <unknown function> (+0xb90c) [0x7fa9a4a6d90c]
        (2) libboost_filesystem.so.1.69.0: <unknown function> (+0xbaee) [0x7fa9a4a6daee]
        (3) libboost_filesystem.so.1.69.0: <unknown function> (+0xbaee) [0x7fa9a4a6daee]
        (4) libboost_filesystem.so.1.69.0: <unknown function> (+0xbaee) [0x7fa9a4a6daee]
        (5) libboost_filesystem.so.1.69.0: boost::filesystem::detail::remove_all(boost::filesystem::path const&, boost::system::error_code*) (+0xaf) [0x7fa9a4a6dfef]
        (6) icinga2: icinga::Utility::RemoveDirRecursive(icinga::String const&) (+0x7e) [0x764e3e]
        (7) icinga2: icinga::ApiListener::TryActivateZonesStageCallback(icinga::ProcessResult const&, std::vector<icinga::String, std::allocator<icinga::String> > const&) (+0x4ca) [0x9156ba]
        (8) /usr/lib64/icinga2/sbin/icinga2() [0x8ff113]
        (9) icinga2: boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::function<void ()> >(std::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::allocator<void>*, boost::system::error_code const&, unsigned long) (+0x106) [0xbf7bf6]
        (10) /usr/lib64/icinga2/sbin/icinga2() [0x630df1]
        (11) /usr/lib64/icinga2/sbin/icinga2() [0x631142]
        (12) icinga2: boost_asio_detail_posix_thread_function (+0xf) [0x82104f]
        (13) libpthread.so.0: <unknown function> (+0x7ea5) [0x7fa9a2949ea5]
        (14) libc.so.6: clone (+0x6d) [0x7fa9a26728dd]


@Decstasy
Copy link

I'm facing the very same problem. I discovered that a slight time differece between satellites can cause this kind of exception.

There was no detection for this problem, although I'm monitoring my zones. There was not a single check that indicates a problem and the last and next scheduled check was hours in the past. After triggering a reload they started to work again and an time offset showed a single second offset.

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.11.3-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Ubuntu
  Platform version: 18.04.4 LTS (Bionic Beaver)
  Kernel: Linux
  Kernel version: 4.15.0-106-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

@N-o-X
Copy link
Contributor

N-o-X commented Jul 29, 2020

This bug should be fixed with #8093. Please reopen, if this still occurs after the 2.11.5 release.

@N-o-X N-o-X closed this as completed Jul 29, 2020
@N-o-X N-o-X added this to the 2.11.5 milestone Jul 29, 2020
@N-o-X N-o-X added area/configuration DSL, parser, compiler, error handling area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention labels Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/configuration DSL, parser, compiler, error handling area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention
Projects
None yet
Development

No branches or pull requests

3 participants