CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054

mrunalinikankariya · 2017-04-20T03:40:37Z

Problem Statement

Hosts take time to reconnect after the restart of the management server. Time goes up proportional to the value of ping.interval*ping.timeout

Root Cause

During the processing of Down event of management server, last ping time out gets reset. It is getting reset to a hardcoded 600 sec behind current time. when ever the admin configures ping.interval, ping.timeout values such that ping.interval*ping.timeout value crosses 600, the hosts misses the first few ping cycles creating this problem.

Solution

The hardcoded value of 600 is changed to make use of ping.interval*ping.timeout value

SudharmaJain · 2017-04-20T06:17:14Z

@mrunalinikankariya Your commit is missing the cloudstack bug id. Please update the commit.

koushik-das · 2017-04-26T10:33:41Z

@mrunalinikankariya The problem you are describing shouldn't be related to ping.interval and ping.timeout. There may be something else that is causing the problem.

cloudmonger · 2017-06-09T11:28:08Z

ACS CI BVT Run

Sumarry:
Build Number 815
Hypervisor xenserver
NetworkType Advanced
Passed=108
Failed=5
Skipped=12

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0

Failed tests:

test_volumes.py
test_06_download_detached_volume Failed
test_routers_network_ops.py
test_01_isolate_network_FW_PF_default_routes_egress_true Failing since 10 runs
test_02_isolate_network_FW_PF_default_routes_egress_false Failing since 10 runs
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failing since 10 runs
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 10 runs

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_01_verify_libvirt
test_02_verify_libvirt_after_restart
test_03_verify_libvirt_attach_disk
test_04_verify_guest_lspci
test_05_change_vm_ostype_restart
test_06_verify_guest_lspci_again
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_vm_snapshots.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py

SudharmaJain · 2017-06-20T05:45:06Z

@mrunalinikankariya Possibly following change in updatestate method of HostDaoImpl may fix this issue.

diff --git a/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java b/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java
index a74b908..9039355 100644
--- a/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java
+++ b/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java
@@ -979,7 +979,9 @@ public class HostDaoImpl extends GenericDaoBase<HostVO, Long> implements HostDao
             }
         }
         if (event.equals(Event.ManagementServerDown)) {
-            ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (10 * 60)));
+            float pingTimeout = NumbersUtil.parseFloat(_configDao.getValue("ping.timeout"), 2.5f);
+            int pingInterval = NumbersUtil.parseInt(_configDao.getValue("ping.interval"), 60);
+            ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (long)(pingTimeout * pingInterval)));
         }

SudharmaJain · 2017-06-20T05:46:19Z

@mrunalinikankariya Also do look at the build failures.

SudharmaJain · 2017-08-07T06:30:35Z

test/integration/component/test_host.py

+        sshClient = SshClient(
+            self.mgtSvrDetails["mgtSvrIp"],
+            22,
+            "user", #self.mgtSvrDetails["user"],


@mrunalinikankariya username shouldn't be hardcoded.

yvsubhash · 2017-08-28T06:13:47Z

LGTM for code

vedulasantosh · 2017-08-29T12:26:37Z

Test LGTM

Before Fix:

It took about 12 mins to connect ESXi host after MS was restarted.

2017-08-29 15:55:57,441 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Running system integrity checker org.apache.cloudstack.utils.identity.ManagementServerNode@56c9aa12
2017-08-29 15:55:57,441 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Configuring CloudStack Components
2017-08-29 16:08:30,772 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-e4b9546b) (logid:7831f281) Simulating start for resource 10.112.3.36 id 1
2017-08-29 16:08:30,772 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-e4b9546b) (logid:7831f281) Creating agent for host 1
2017-08-29 16:08:30,951 DEBUG [c.c.s.StatsCollector] (StatsCollector-4:ctx-200c10b0) (logid:88d478b5) HostOutOfBandManagementStatsCollector is running...

After Fix:

It took about 38 sec to connect ESXi host after MS was restarted.

2017-08-29 17:42:14,432 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Running system integrity checker org.apache.cloudstack.utils.identity.ManagementServerNode@4d1847de
2017-08-29 17:42:14,433 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Configuring CloudStack Components
2017-08-29 17:42:52,302 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-2410f7cb) (logid:b06510de) Simulating start for resource 10.112.3.36 id 1
2017-08-29 17:42:52,302 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-2410f7cb) (logid:b06510de) Creating agent for host 1
2017-08-29 17:42:53,748 DEBUG [c.c.s.StatsCollector] (StatsCollector-1:ctx-5c38226e) (logid:203d8311) HostStatsCollector is running...

cloudmonger · 2017-08-30T22:58:40Z

ACS CI BVT Run

Sumarry:
Build Number 1169
Hypervisor xenserver
NetworkType Advanced
Passed=115
Failed=5
Skipped=40

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0

Failed tests:

test_volumes.py
test_06_download_detached_volume Failing since 2 runs
test_routers_network_ops.py
test_01_isolate_network_FW_PF_default_routes_egress_true Failing since 33 runs
test_02_isolate_network_FW_PF_default_routes_egress_false Failing since 160 runs
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failing since 155 runs
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 155 runs

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_01_verify_libvirt
test_02_verify_libvirt_after_restart
test_03_verify_libvirt_attach_disk
test_04_verify_guest_lspci
test_05_change_vm_ostype_restart
test_06_verify_guest_lspci_again
test_disable_oobm_ha_state_ineligible
test_ha_kvm_host_degraded
test_ha_kvm_host_fencing
test_ha_kvm_host_recovering
test_hostha_configure_default_driver
test_hostha_enable_ha_when_host_disabled
test_hostha_enable_ha_when_host_disconected
test_hostha_enable_ha_when_host_in_maintenance
test_remove_ha_provider_not_possible
test_configure_ha_provider_invalid
test_configure_ha_provider_valid
test_ha_configure_enabledisable_across_clusterzones
test_ha_disable_feature_invalid
test_ha_enable_feature_invalid
test_ha_list_providers
test_ha_multiple_mgmt_server_ownership
test_ha_verify_fsm_available
test_ha_verify_fsm_degraded
test_ha_verify_fsm_fenced
test_ha_verify_fsm_recovering
test_hostha_configure_default_driver
test_hostha_configure_invalid_provider
test_hostha_disable_feature_valid
test_hostha_enable_feature_valid
test_hostha_enable_feature_without_setting_provider
test_list_ha_for_host
test_list_ha_for_host_invalid
test_list_ha_for_host_valid
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_vm_snapshots.py
test_over_provisioning.py
test_global_settings.py
test_router_dnsservice.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_outofbandmanagement_nestedplugin.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py

mrunalinikankariya · 2017-09-04T04:05:14Z

tag:This is Ready to Merge

…ime to connect hosts

rohityadavcloud · 2017-09-14T12:32:54Z

@harikrishna-patnala why was this PR merged without tests? This affects clustering of management servers and the code/commit was pushed after last BVT test results/reports, in which case a fresh test run would be been perfect. In future you or anyone may ping @borisstoyanov or I and we can help with reviewing and running tests. Thanks.

rohityadavcloud · 2017-09-14T12:33:36Z

engine/schema/src/com/cloud/host/dao/HostDaoImpl.java

@@ -144,6 +146,8 @@
    protected HostTransferMapDao _hostTransferDao;
    @Inject
    protected ClusterDao _clusterDao;
+    @Inject
+    private ConfigurationDao _configDao;


@mrunalinikankariya please re-submit a PR that uses a ConfigKey based approach than use direct read/manipulation based on configDao.

Ok, will create another PR to include ConfigKey based approach

@rhtyd,

Created new PR2292 for config key based approach.

rohityadavcloud · 2017-09-14T12:34:03Z

engine/schema/src/com/cloud/host/dao/HostDaoImpl.java

@@ -991,7 +995,9 @@ public boolean updateState(Status oldStatus, Event event, Status newStatus, Host
            }
        }
        if (event.equals(Event.ManagementServerDown)) {
-            ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (10 * 60)));
+            Float pingTimeout = NumbersUtil.parseFloat(_configDao.getValue("ping.timeout"), 2.5f);


@mrunalinikankariya please move towards using ConfigKeys, refactorings are requested.

rohityadavcloud · 2017-09-14T12:35:05Z

test/integration/component/test_host.py

+_multiprocess_shared_ = False
+
+
+class TestHostHA(cloudstackTestCase):


@mrunalinikankariya @harikrishna-patnala advise if this test can run rather quickly -- if so let's include them in smoke tests folder instead? Also, the PR has no test results/reports from this new test.

This test wait for management-server to be up and hence it takes long time to run. So this cannot be move to smoke tests

@mrunalinikankariya the ping timeout/intervals are configurable to speed up wait time, also in the code you've put wait_until(10,10 if this already passes then the test should run in likely 100seconds or less, that less than 10 minutes and qualifies the test to be put in smoke tests. I also see an outstanding comment from Koushik which was not answered.

@mrunalinikankariya please send a new PR with requested changes

Ping @mrunalinikankariya ?

wait_until(10,10 here we are waiting for Host to be UP and

time.sleep(self.services["sleep"]) here we wait for 60 sec after server restart.

As it involve lot of wait so i avoid it add it to smoke

rohityadavcloud · 2017-09-15T05:50:31Z

I reviewed the code, as @koushik-das mentioned further investigation may be needed to look into why hosts would not be accepted immediately. The solution might work, however, we may need further analysis. @mrunalinikankariya @SudharmaJain @yvsubhash @vedulasantosh JIRA ticket does not mention, can you comment which hypervisors/hosts you had seen this issue? It seems only direct agents such as xen, vmware resources may be affected, then indirect agents such as KVM (agents).

yvsubhash · 2017-09-25T09:31:55Z

@rhtyd the concern expressed by @koushik-das is already addressed in the change. Direct agents are affected with this. During the processing of Down event of management server, last ping time out gets reset. It is getting reset to a hardcoded 600 sec behind current time. when ever the admin configures ping.interval, ping.timeout values such that ping.intervalping.timeout value crosses 600, the hosts misses the first few ping cycles creating this problem. So the hardcoded value is changed to make use of ping.intervalping.timeout value. Hope this addresses your concerns

mrunalinikankariya changed the title ~~After restarting cloudstack-management , It takes time to connect hosts~~ CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts Apr 20, 2017

kiwiflyer added the type:enhancement label Apr 21, 2017

mrunalinikankariya closed this Jun 20, 2017

mrunalinikankariya reopened this Jun 20, 2017

SudharmaJain reviewed Aug 7, 2017

View reviewed changes

CLOUDSTACK-9886 : After restarting cloudstack-management , It takes t…

e894022

…ime to connect hosts

harikrishna-patnala merged commit b655f9b into apache:master Sep 14, 2017

rohityadavcloud reviewed Sep 14, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054

CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054

mrunalinikankariya commented Apr 20, 2017 •

edited

Loading

SudharmaJain commented Apr 20, 2017

koushik-das commented Apr 26, 2017

cloudmonger commented Jun 9, 2017

SudharmaJain commented Jun 20, 2017 •

edited

Loading

SudharmaJain commented Jun 20, 2017

SudharmaJain Aug 7, 2017

yvsubhash commented Aug 28, 2017

vedulasantosh commented Aug 29, 2017

cloudmonger commented Aug 30, 2017

mrunalinikankariya commented Sep 4, 2017

rohityadavcloud commented Sep 14, 2017

rohityadavcloud Sep 14, 2017

mrunalinikankariya Sep 28, 2017

mrunalinikankariya Oct 11, 2017

rohityadavcloud Sep 14, 2017

rohityadavcloud Sep 14, 2017

mrunalinikankariya Sep 15, 2017

rohityadavcloud Sep 15, 2017

rohityadavcloud Sep 15, 2017

rohityadavcloud Sep 22, 2017

mrunalinikankariya Sep 28, 2017

rohityadavcloud commented Sep 15, 2017

yvsubhash commented Sep 25, 2017

		_multiprocess_shared_ = False


		class TestHostHA(cloudstackTestCase):

CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054

CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054

Conversation

mrunalinikankariya commented Apr 20, 2017 • edited Loading

SudharmaJain commented Apr 20, 2017

koushik-das commented Apr 26, 2017

cloudmonger commented Jun 9, 2017

ACS CI BVT Run

SudharmaJain commented Jun 20, 2017 • edited Loading

SudharmaJain commented Jun 20, 2017

Choose a reason for hiding this comment

yvsubhash commented Aug 28, 2017

vedulasantosh commented Aug 29, 2017

cloudmonger commented Aug 30, 2017

ACS CI BVT Run

mrunalinikankariya commented Sep 4, 2017

rohityadavcloud commented Sep 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohityadavcloud commented Sep 15, 2017

yvsubhash commented Sep 25, 2017

mrunalinikankariya commented Apr 20, 2017 •

edited

Loading

SudharmaJain commented Jun 20, 2017 •

edited

Loading