-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLOUDSTACK-9886 : After restarting cloudstack-management , It takes time to connect hosts #2054
Conversation
@mrunalinikankariya Your commit is missing the cloudstack bug id. Please update the commit. |
@mrunalinikankariya The problem you are describing shouldn't be related to ping.interval and ping.timeout. There may be something else that is causing the problem. |
ACS CI BVT RunSumarry: Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0 Failed tests:
Skipped tests: Passed test suits: |
@mrunalinikankariya Possibly following change in updatestate method of HostDaoImpl may fix this issue.
|
@mrunalinikankariya Also do look at the build failures. |
sshClient = SshClient( | ||
self.mgtSvrDetails["mgtSvrIp"], | ||
22, | ||
"user", #self.mgtSvrDetails["user"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalinikankariya username shouldn't be hardcoded.
LGTM for code |
Test LGTM Before Fix: It took about 12 mins to connect ESXi host after MS was restarted. 2017-08-29 15:55:57,441 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Running system integrity checker org.apache.cloudstack.utils.identity.ManagementServerNode@56c9aa12 After Fix: It took about 38 sec to connect ESXi host after MS was restarted. 2017-08-29 17:42:14,432 INFO [o.a.c.s.l.CloudStackExtendedLifeCycle] (main:null) (logid:) Running system integrity checker org.apache.cloudstack.utils.identity.ManagementServerNode@4d1847de |
ACS CI BVT RunSumarry: Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0 Failed tests:
Skipped tests: Passed test suits: |
tag:This is Ready to Merge |
…ime to connect hosts
@harikrishna-patnala why was this PR merged without tests? This affects clustering of management servers and the code/commit was pushed after last BVT test results/reports, in which case a fresh test run would be been perfect. In future you or anyone may ping @borisstoyanov or I and we can help with reviewing and running tests. Thanks. |
@@ -144,6 +146,8 @@ | |||
protected HostTransferMapDao _hostTransferDao; | |||
@Inject | |||
protected ClusterDao _clusterDao; | |||
@Inject | |||
private ConfigurationDao _configDao; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalinikankariya please re-submit a PR that uses a ConfigKey based approach than use direct read/manipulation based on configDao
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will create another PR to include ConfigKey based approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created new PR2292 for config key based approach.
@@ -991,7 +995,9 @@ public boolean updateState(Status oldStatus, Event event, Status newStatus, Host | |||
} | |||
} | |||
if (event.equals(Event.ManagementServerDown)) { | |||
ub.set(host, _pingTimeAttr, ((System.currentTimeMillis() >> 10) - (10 * 60))); | |||
Float pingTimeout = NumbersUtil.parseFloat(_configDao.getValue("ping.timeout"), 2.5f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalinikankariya please move towards using ConfigKeys, refactorings are requested.
_multiprocess_shared_ = False | ||
|
||
|
||
class TestHostHA(cloudstackTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalinikankariya @harikrishna-patnala advise if this test can run rather quickly -- if so let's include them in smoke tests folder instead? Also, the PR has no test results/reports from this new test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test wait for management-server to be up and hence it takes long time to run. So this cannot be move to smoke tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalinikankariya the ping timeout/intervals are configurable to speed up wait time, also in the code you've put wait_until(10,10
if this already passes then the test should run in likely 100seconds or less, that less than 10 minutes and qualifies the test to be put in smoke tests. I also see an outstanding comment from Koushik which was not answered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalinikankariya please send a new PR with requested changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @mrunalinikankariya ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait_until(10,10 here we are waiting for Host to be UP and
time.sleep(self.services["sleep"]) here we wait for 60 sec after server restart.
As it involve lot of wait so i avoid it add it to smoke
I reviewed the code, as @koushik-das mentioned further investigation may be needed to look into why hosts would not be accepted immediately. The solution might work, however, we may need further analysis. @mrunalinikankariya @SudharmaJain @yvsubhash @vedulasantosh JIRA ticket does not mention, can you comment which hypervisors/hosts you had seen this issue? It seems only direct agents such as xen, vmware resources may be affected, then indirect agents such as KVM (agents). |
@rhtyd the concern expressed by @koushik-das is already addressed in the change. Direct agents are affected with this. During the processing of Down event of management server, last ping time out gets reset. It is getting reset to a hardcoded 600 sec behind current time. when ever the admin configures ping.interval, ping.timeout values such that ping.intervalping.timeout value crosses 600, the hosts misses the first few ping cycles creating this problem. So the hardcoded value is changed to make use of ping.intervalping.timeout value. Hope this addresses your concerns |
Problem Statement
Hosts take time to reconnect after the restart of the management server. Time goes up proportional to the value of ping.interval*ping.timeout
Root Cause
During the processing of Down event of management server, last ping time out gets reset. It is getting reset to a hardcoded 600 sec behind current time. when ever the admin configures ping.interval, ping.timeout values such that ping.interval*ping.timeout value crosses 600, the hosts misses the first few ping cycles creating this problem.
Solution
The hardcoded value of 600 is changed to make use of ping.interval*ping.timeout value