Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LR] Queue cancellation job causes executions to ignore limits #2472

Closed
ittaLaRedoute opened this issue Aug 2, 2023 · 2 comments
Closed

Comments

@ittaLaRedoute
Copy link

From previous testing we found that when cerberus_automaticqueuecancellationjob_timeout times out, with a lot of tests on queue, Cerberus will try to execute everything on the queue.

We might have pinned down the problem on TestCaseExecutionQueueDAO.java on the method updateToCancelledOldRecord.

It's calling the following SQL:

UPDATE testcaseexecutionqueue
SET `State` = 'CANCELLED', `RequestDate` = now(), `DateModif` = now(), `comment` = 'Cancelled by automatic job.'
WHERE TO_SECONDS(now()) - TO_SECONDS(DateCreated) > 300 -- cerberus_automaticqueuecancellationjob_timeout value
AND `State` IN ('WAITING','STARTING','EXECUTING')
;

If we have 72 tests running (ROBOTHOST invariant limit) and the entries are cancelled on testcaseexecutionqueue, those tests won't get cancelled on testcaseexecution and will continue executing. Because it just cancelled the tests on the queue with status EXECUTING it will try to run another 72 (plus the 72 that are still running). Now we will have 144 tests running. This quickly gets out of control and we get hundreds of tests executing at the same time which breaks the selenium hub. We think this is because Cerberus takes into account the value on DateCreated to calculate the timeout which means it will cancel those executions even though they are now running.

To replicate this we lowered the value of cerberus_automaticqueuecancellationjob_timeout parameter to 300 seconds (5 minutes) and launched a campaign with roughly 1000 tests.

Behaviour replicated on Cerberus version:
4.17-SNAPSHOT-1747 Build 20230129-153723
4.17-SNAPSHOT-1749 Build 20230731-210709

@ittaLaRedoute ittaLaRedoute changed the title Queue cancellation job causes executions to ignore limits [LR] Queue cancellation job causes executions to ignore limits Aug 2, 2023
@vertigo17
Copy link
Member

Hello @ittaLaRedoute
Parameter cerberus_automaticqueuecancellationjob_timeout is used in order to clean old entries that could be stuck in EXECUTING status in the queue when in fact the execution either finished or crashed without having time to clean the queue entry to the DONE status.
That parameter should always be higher that the longest test that you ever run.
If your parameter cerberus_automaticqueuecancellationjob_timeout was at 300 when having the issue, that means that all your 72 testcases last more than 300 seconds.
In that case the best option is to increase the parameter to the longest test + a few minutes by security.
BTW, Default value for that parameter is 3600 s (1 Hour). I don't know why you moved it to 300 ?
This is for sure very short.
A queue entry in EXECUTING that last for more than 1 hour is for sure not normal. In that case, for sure the test is no longer running on the hub side. That process move the entry to CANCELLED in order to free the slot and allow to submit more execution.
Everything that you describe is the expected behaviour. The only thing not normal is the value of that parameter that is way too low in your context.

@ittaLaRedoute
Copy link
Author

After discussion with Benoit, we confirm that's a bug. A test which stay too long in the queue risks to be cancelled at the execution start.

vertigo17 added a commit that referenced this issue Aug 4, 2023
…not not consider the timeout from the time it was inserted to the queue but the time when the execution was triggered. #2472
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants