Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054

Merged
merged 1 commit into from Sep 14, 2017

Conversation

mrunalinikankariya
Copy link
Contributor

@mrunalinikankariya mrunalinikankariya commented Apr 20, 2017

Problem Statement


Hosts take time to reconnect after the restart of the management server. Time goes up proportional to the value of ping.interval*ping.timeout

Root Cause


During the processing of Down event of management server, last ping time out gets reset. It is getting reset to a hardcoded 600 sec behind current time. when ever the admin configures ping.interval, ping.timeout values such that ping.interval*ping.timeout value crosses 600, the hosts misses the first few ping cycles creating this problem.

Solution


The hardcoded value of 600 is changed to make use of ping.interval*ping.timeout value

@SudharmaJain
Copy link
Contributor

@mrunalinikankariya Your commit is missing the cloudstack bug id. Please update the commit.

@mrunalinikankariya mrunalinikankariya changed the title After restarting cloudstack-management , It takes time to connect hosts CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts Apr 20, 2017
@koushik-das
Copy link
Contributor

@mrunalinikankariya The problem you are describing shouldn't be related to ping.interval and ping.timeout. There may be something else that is causing the problem.

@cloudmonger
Copy link

ACS CI BVT Run

Sumarry:
Build Number 815
Hypervisor xenserver
NetworkType Advanced
Passed=108
Failed=5
Skipped=12

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0

Failed tests:

  • test_volumes.py

  • test_06_download_detached_volume Failed

  • test_routers_network_ops.py

  • test_01_isolate_network_FW_PF_default_routes_egress_true Failing since 10 runs

  • test_02_isolate_network_FW_PF_default_routes_egress_false Failing since 10 runs

  • test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failing since 10 runs

  • test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 10 runs

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_01_verify_libvirt
test_02_verify_libvirt_after_restart
test_03_verify_libvirt_attach_disk
test_04_verify_guest_lspci
test_05_change_vm_ostype_restart
test_06_verify_guest_lspci_again
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_vm_snapshots.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py

@SudharmaJain
Copy link
Contributor

SudharmaJain commented Jun 20, 2017

@mrunalinikankariya Possibly following change in updatestate method of HostDaoImpl may fix this issue.

diff --git a/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java b/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java
index a74b908..9039355 100644
--- a/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java
+++ b/engine/schema/src/com/cloud/host/dao/HostDaoImpl.java
@@ -979,7 +979,9 @@ public class HostDaoImpl extends GenericDaoBase<HostVO, Long> implements HostDao
             }
         }
         if (event.equals(Event.ManagementServerDown)) {
-            ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (10 * 60)));
+            float pingTimeout = NumbersUtil.parseFloat(_configDao.getValue("ping.timeout"), 2.5f);
+            int pingInterval = NumbersUtil.parseInt(_configDao.getValue("ping.interval"), 60);
+            ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (long)(pingTimeout * pingInterval)));
         }

@SudharmaJain
Copy link
Contributor

@mrunalinikankariya Also do look at the build failures.

sshClient = SshClient(
self.mgtSvrDetails["mgtSvrIp"],
22,
"user", #self.mgtSvrDetails["user"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalinikankariya username shouldn't be hardcoded.

@yvsubhash
Copy link

LGTM for code

@vedulasantosh
Copy link
Contributor

Test LGTM

Before Fix:

It took about 12 mins to connect ESXi host after MS was restarted.

2017-08-29 15:55:57,441 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Running system integrity checker org.apache.cloudstack.utils.identity.ManagementServerNode@56c9aa12
2017-08-29 15:55:57,441 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Configuring CloudStack Components
2017-08-29 16:08:30,772 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-e4b9546b) (logid:7831f281) Simulating start for resource 10.112.3.36 id 1
2017-08-29 16:08:30,772 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-e4b9546b) (logid:7831f281) Creating agent for host 1
2017-08-29 16:08:30,951 DEBUG [c.c.s.StatsCollector] (StatsCollector-4:ctx-200c10b0) (logid:88d478b5) HostOutOfBandManagementStatsCollector is running...

After Fix:

It took about 38 sec to connect ESXi host after MS was restarted.

2017-08-29 17:42:14,432 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Running system integrity checker org.apache.cloudstack.utils.identity.ManagementServerNode@4d1847de
2017-08-29 17:42:14,433 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Configuring CloudStack Components
2017-08-29 17:42:52,302 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-2410f7cb) (logid:b06510de) Simulating start for resource 10.112.3.36 id 1
2017-08-29 17:42:52,302 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-1:ctx-2410f7cb) (logid:b06510de) Creating agent for host 1
2017-08-29 17:42:53,748 DEBUG [c.c.s.StatsCollector] (StatsCollector-1:ctx-5c38226e) (logid:203d8311) HostStatsCollector is running...

@cloudmonger
Copy link

ACS CI BVT Run

Sumarry:
Build Number 1169
Hypervisor xenserver
NetworkType Advanced
Passed=115
Failed=5
Skipped=40

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0

Failed tests:

  • test_volumes.py

  • test_06_download_detached_volume Failing since 2 runs

  • test_routers_network_ops.py

  • test_01_isolate_network_FW_PF_default_routes_egress_true Failing since 33 runs

  • test_02_isolate_network_FW_PF_default_routes_egress_false Failing since 160 runs

  • test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failing since 155 runs

  • test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 155 runs

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_01_verify_libvirt
test_02_verify_libvirt_after_restart
test_03_verify_libvirt_attach_disk
test_04_verify_guest_lspci
test_05_change_vm_ostype_restart
test_06_verify_guest_lspci_again
test_disable_oobm_ha_state_ineligible
test_ha_kvm_host_degraded
test_ha_kvm_host_fencing
test_ha_kvm_host_recovering
test_hostha_configure_default_driver
test_hostha_enable_ha_when_host_disabled
test_hostha_enable_ha_when_host_disconected
test_hostha_enable_ha_when_host_in_maintenance
test_remove_ha_provider_not_possible
test_configure_ha_provider_invalid
test_configure_ha_provider_valid
test_ha_configure_enabledisable_across_clusterzones
test_ha_disable_feature_invalid
test_ha_enable_feature_invalid
test_ha_list_providers
test_ha_multiple_mgmt_server_ownership
test_ha_verify_fsm_available
test_ha_verify_fsm_degraded
test_ha_verify_fsm_fenced
test_ha_verify_fsm_recovering
test_hostha_configure_default_driver
test_hostha_configure_invalid_provider
test_hostha_disable_feature_valid
test_hostha_enable_feature_valid
test_hostha_enable_feature_without_setting_provider
test_list_ha_for_host
test_list_ha_for_host_invalid
test_list_ha_for_host_valid
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_vm_snapshots.py
test_over_provisioning.py
test_global_settings.py
test_router_dnsservice.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_outofbandmanagement_nestedplugin.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_disk_offerings.py

@mrunalinikankariya
Copy link
Contributor Author

tag:This is Ready to Merge

@harikrishna-patnala harikrishna-patnala merged commit b655f9b into apache:master Sep 14, 2017
@rohityadavcloud
Copy link
Member

@harikrishna-patnala why was this PR merged without tests? This affects clustering of management servers and the code/commit was pushed after last BVT test results/reports, in which case a fresh test run would be been perfect. In future you or anyone may ping @borisstoyanov or I and we can help with reviewing and running tests. Thanks.

@@ -144,6 +146,8 @@
protected HostTransferMapDao _hostTransferDao;
@Inject
protected ClusterDao _clusterDao;
@Inject
private ConfigurationDao _configDao;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalinikankariya please re-submit a PR that uses a ConfigKey based approach than use direct read/manipulation based on configDao.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will create another PR to include ConfigKey based approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhtyd,

Created new PR2292 for config key based approach.

@@ -991,7 +995,9 @@ public boolean updateState(Status oldStatus, Event event, Status newStatus, Host
}
}
if (event.equals(Event.ManagementServerDown)) {
ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (10 * 60)));
Float pingTimeout = NumbersUtil.parseFloat(_configDao.getValue("ping.timeout"), 2.5f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalinikankariya please move towards using ConfigKeys, refactorings are requested.

_multiprocess_shared_ = False


class TestHostHA(cloudstackTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalinikankariya @harikrishna-patnala advise if this test can run rather quickly -- if so let's include them in smoke tests folder instead? Also, the PR has no test results/reports from this new test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test wait for management-server to be up and hence it takes long time to run. So this cannot be move to smoke tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalinikankariya the ping timeout/intervals are configurable to speed up wait time, also in the code you've put wait_until(10,10 if this already passes then the test should run in likely 100seconds or less, that less than 10 minutes and qualifies the test to be put in smoke tests. I also see an outstanding comment from Koushik which was not answered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalinikankariya please send a new PR with requested changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_until(10,10 here we are waiting for Host to be UP and

time.sleep(self.services["sleep"]) here we wait for 60 sec after server restart.

As it involve lot of wait so i avoid it add it to smoke

@rohityadavcloud
Copy link
Member

I reviewed the code, as @koushik-das mentioned further investigation may be needed to look into why hosts would not be accepted immediately. The solution might work, however, we may need further analysis. @mrunalinikankariya @SudharmaJain @yvsubhash @vedulasantosh JIRA ticket does not mention, can you comment which hypervisors/hosts you had seen this issue? It seems only direct agents such as xen, vmware resources may be affected, then indirect agents such as KVM (agents).

@yvsubhash
Copy link

@rhtyd the concern expressed by @koushik-das is already addressed in the change. Direct agents are affected with this. During the processing of Down event of management server, last ping time out gets reset. It is getting reset to a hardcoded 600 sec behind current time.  when ever the admin configures ping.interval, ping.timeout values such that  ping.intervalping.timeout value crosses 600, the hosts misses the first few ping cycles creating this problem. So the hardcoded value is changed to make use of ping.intervalping.timeout value. Hope this addresses your concerns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants