You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From previous testing we found that when cerberus_automaticqueuecancellationjob_timeout times out, with a lot of tests on queue, Cerberus will try to execute everything on the queue.
We might have pinned down the problem on TestCaseExecutionQueueDAO.java on the method updateToCancelledOldRecord.
It's calling the following SQL:
UPDATE testcaseexecutionqueue
SET `State` = 'CANCELLED', `RequestDate` = now(), `DateModif` = now(), `comment` = 'Cancelled by automatic job.'
WHERE TO_SECONDS(now()) - TO_SECONDS(DateCreated) > 300 -- cerberus_automaticqueuecancellationjob_timeout value
AND `State` IN ('WAITING','STARTING','EXECUTING')
;
If we have 72 tests running (ROBOTHOST invariant limit) and the entries are cancelled on testcaseexecutionqueue, those tests won't get cancelled on testcaseexecution and will continue executing. Because it just cancelled the tests on the queue with status EXECUTING it will try to run another 72 (plus the 72 that are still running). Now we will have 144 tests running. This quickly gets out of control and we get hundreds of tests executing at the same time which breaks the selenium hub. We think this is because Cerberus takes into account the value on DateCreated to calculate the timeout which means it will cancel those executions even though they are now running.
To replicate this we lowered the value of cerberus_automaticqueuecancellationjob_timeout parameter to 300 seconds (5 minutes) and launched a campaign with roughly 1000 tests.
The text was updated successfully, but these errors were encountered:
ittaLaRedoute
changed the title
Queue cancellation job causes executions to ignore limits
[LR] Queue cancellation job causes executions to ignore limits
Aug 2, 2023
Hello @ittaLaRedoute
Parameter cerberus_automaticqueuecancellationjob_timeout is used in order to clean old entries that could be stuck in EXECUTING status in the queue when in fact the execution either finished or crashed without having time to clean the queue entry to the DONE status.
That parameter should always be higher that the longest test that you ever run.
If your parameter cerberus_automaticqueuecancellationjob_timeout was at 300 when having the issue, that means that all your 72 testcases last more than 300 seconds.
In that case the best option is to increase the parameter to the longest test + a few minutes by security.
BTW, Default value for that parameter is 3600 s (1 Hour). I don't know why you moved it to 300 ?
This is for sure very short.
A queue entry in EXECUTING that last for more than 1 hour is for sure not normal. In that case, for sure the test is no longer running on the hub side. That process move the entry to CANCELLED in order to free the slot and allow to submit more execution.
Everything that you describe is the expected behaviour. The only thing not normal is the value of that parameter that is way too low in your context.
From previous testing we found that when cerberus_automaticqueuecancellationjob_timeout times out, with a lot of tests on queue, Cerberus will try to execute everything on the queue.
We might have pinned down the problem on TestCaseExecutionQueueDAO.java on the method updateToCancelledOldRecord.
It's calling the following SQL:
If we have 72 tests running (ROBOTHOST invariant limit) and the entries are cancelled on testcaseexecutionqueue, those tests won't get cancelled on testcaseexecution and will continue executing. Because it just cancelled the tests on the queue with status EXECUTING it will try to run another 72 (plus the 72 that are still running). Now we will have 144 tests running. This quickly gets out of control and we get hundreds of tests executing at the same time which breaks the selenium hub. We think this is because Cerberus takes into account the value on DateCreated to calculate the timeout which means it will cancel those executions even though they are now running.
To replicate this we lowered the value of cerberus_automaticqueuecancellationjob_timeout parameter to 300 seconds (5 minutes) and launched a campaign with roughly 1000 tests.
Behaviour replicated on Cerberus version:
4.17-SNAPSHOT-1747 Build 20230129-153723
4.17-SNAPSHOT-1749 Build 20230731-210709
The text was updated successfully, but these errors were encountered: