-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queues: fix interactions with the scheduler paused and task held states #4620
Conversation
Tested manually @ 124b2b2 on Mac OS, passed the following:
|
Haven't had a proper look at the code but I cannot reproduce the bug on this branch 👍 |
FYI: I'm taking a look at preventing tasks being released from queues whilst the Scheduler is paused as I think with this proposed solution the held tasks are occupying queue slots (because they were held after being de-queued). |
For what it's worth / comparison purposes, here is what I came up with: master...MetRonnie:pause-hold-resume-bug Only problem is you get this warning for some reason
Edit: Ah I didn't see you'd already come up with this here #4278 (comment) |
Unfortunately that works because of faulty queueing logic (#4628), because |
Unfortunately this issue has revealed further problems:
I think they can be fixed here fairly simply by:
|
Ok, implemented the above with one additional modification to make Re-wrote the description to match the new approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from the new integration test seems to be flaky on GH Actions
* Closes: * cylc#4278 * cylc#4627 * cylc#4628 * Makes the following changes: * Held tasks are no longer be released from queues. * `pre_prep_tasks` (previously `pre_submit_tasks`) are now included with active tasks for the computation of queue limits. * Queues are no longer processed whilst the workflow is paused. * User's should now be able to safely hold/release tasks: * When they are not in the pool (future tasks). * When they are in the pool but not yet queued. * When they are queued. * When they are in `pre_prep_tasks` (previously `pre_submit_tasks`) which is an intermediary state tasks pass through *after* they have been released form the queue but *before* they are passed into the job preparation pipeline (and acquired the preparing job status). * Tasks can also be held whilst they are preparing, submitted & running, however, this will continue to have no effect (except on automatic retries, note `cylc kill`).
@MetRonnie Haven't managed to replicate the flakyness locally, have pushed up a commit which I think will help. It adds a new integration fixture that allows us to start workflows without running the main loop which should remove an unnecessary moving part. Can't say if that was causing the issue though. |
* Preserves the existing `run` fixture. * Adds a new `start` fixture which does everything `run` does, except running the main loop which should be unnecessary for most integration test purposes. * This reduces the number of moving parts and avoids the main loop interacting with tests in unintended ways.
Looks like it worked, re-running to be safe... |
Test failure in 4/4 totally unrelated - #4633 |
# put things back the way we found them | ||
for itask in schd.pool.get_all_tasks(): | ||
itask.state.reset(TASK_STATUS_WAITING) | ||
schd.data_store_mgr.delta_task_state(itask) | ||
await schd.update_data_structure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less hacky way might be to add a non-module scoped version of the harness
fixture and use that instead? We shouldn't really be mutating module scoped fixture data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test needs a larger re-think - #4175 - so I've just patched it for now.
The tests actually rely on previous tests mutating the data.
Someone needs to go through and straighten them out but I'm not 100% on the interactions it is trying to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review posted to show that I'm looking
- Checking against three tickets marked as closed by change.
- Check the code changes.
- Check the test changes.
- Check the source against the bullet points in the PR description
- Had a really good go at breaking this logic
edit
meant to post this as a comment not an approval - have re-requested my review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixed logic looks good. Had a good play with it, no problems found 👍
play --pause
can make held tasks run #4278pre_prep_tasks
(previouslypre_submit_tasks
) are now includedwith active tasks for the computation of queue limits.
pre_prep_tasks
(previouslypre_submit_tasks
) whichis an intermediary state tasks pass through after they have been
released form the queue but before they are passed into the job
preparation pipeline (and acquired the preparing job status).
however, this will continue to have no effect (except on automatic
retries, note
cylc kill
).Requirements check-list
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
andconda-environment.yml
.