[LR] Queue cancellation job causes executions to ignore limits #2472

ittaLaRedoute · 2023-08-02T15:04:48Z

From previous testing we found that when cerberus_automaticqueuecancellationjob_timeout times out, with a lot of tests on queue, Cerberus will try to execute everything on the queue.

We might have pinned down the problem on TestCaseExecutionQueueDAO.java on the method updateToCancelledOldRecord.

It's calling the following SQL:

UPDATE testcaseexecutionqueue
SET `State` = 'CANCELLED', `RequestDate` = now(), `DateModif` = now(), `comment` = 'Cancelled by automatic job.'
WHERE TO_SECONDS(now()) - TO_SECONDS(DateCreated) > 300 -- cerberus_automaticqueuecancellationjob_timeout value
AND `State` IN ('WAITING','STARTING','EXECUTING')
;

If we have 72 tests running (ROBOTHOST invariant limit) and the entries are cancelled on testcaseexecutionqueue, those tests won't get cancelled on testcaseexecution and will continue executing. Because it just cancelled the tests on the queue with status EXECUTING it will try to run another 72 (plus the 72 that are still running). Now we will have 144 tests running. This quickly gets out of control and we get hundreds of tests executing at the same time which breaks the selenium hub. We think this is because Cerberus takes into account the value on DateCreated to calculate the timeout which means it will cancel those executions even though they are now running.

To replicate this we lowered the value of cerberus_automaticqueuecancellationjob_timeout parameter to 300 seconds (5 minutes) and launched a campaign with roughly 1000 tests.

Behaviour replicated on Cerberus version:
4.17-SNAPSHOT-1747 Build 20230129-153723
4.17-SNAPSHOT-1749 Build 20230731-210709

The text was updated successfully, but these errors were encountered:

vertigo17 · 2023-08-02T21:12:47Z

Hello @ittaLaRedoute
Parameter cerberus_automaticqueuecancellationjob_timeout is used in order to clean old entries that could be stuck in EXECUTING status in the queue when in fact the execution either finished or crashed without having time to clean the queue entry to the DONE status.
That parameter should always be higher that the longest test that you ever run.
If your parameter cerberus_automaticqueuecancellationjob_timeout was at 300 when having the issue, that means that all your 72 testcases last more than 300 seconds.
In that case the best option is to increase the parameter to the longest test + a few minutes by security.
BTW, Default value for that parameter is 3600 s (1 Hour). I don't know why you moved it to 300 ?
This is for sure very short.
A queue entry in EXECUTING that last for more than 1 hour is for sure not normal. In that case, for sure the test is no longer running on the hub side. That process move the entry to CANCELLED in order to free the slot and allow to submit more execution.
Everything that you describe is the expected behaviour. The only thing not normal is the value of that parameter that is way too low in your context.

ittaLaRedoute · 2023-08-03T13:37:17Z

After discussion with Benoit, we confirm that's a bug. A test which stay too long in the queue risks to be cancelled at the execution start.

…not not consider the timeout from the time it was inserted to the queue but the time when the execution was triggered. #2472

ittaLaRedoute changed the title ~~Queue cancellation job causes executions to ignore limits~~ [LR] Queue cancellation job causes executions to ignore limits Aug 2, 2023

vertigo17 self-assigned this Aug 2, 2023

vertigo17 added Perim : Queueing system Prio : 2 normal Nat : question labels Aug 2, 2023

vertigo17 added Nat : bug and removed Nat : question labels Aug 4, 2023

vertigo17 added a commit that referenced this issue Aug 4, 2023

Fixed the Job that CANCELLED execution queue entries so that it does …

bc326b2

…not not consider the timeout from the time it was inserted to the queue but the time when the execution was triggered. #2472

vertigo17 closed this as completed Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LR] Queue cancellation job causes executions to ignore limits #2472

[LR] Queue cancellation job causes executions to ignore limits #2472

ittaLaRedoute commented Aug 2, 2023

vertigo17 commented Aug 2, 2023

ittaLaRedoute commented Aug 3, 2023

[LR] Queue cancellation job causes executions to ignore limits #2472

[LR] Queue cancellation job causes executions to ignore limits #2472

Comments

ittaLaRedoute commented Aug 2, 2023

vertigo17 commented Aug 2, 2023

ittaLaRedoute commented Aug 3, 2023