queues: fix interactions with the scheduler paused and task held states #4620

oliver-sanders · 2022-01-27T14:55:53Z

[edit] pushed up a new commit with a new approach after discovering the problem went deeper than initially thought.

There are now three issues each has its own steps to reproduce and integration test.
The new integration tests run on master but you will need to overwite tests/integration/conftest.py and sed -i 's/pre_submit_tasks/prep_submit_tasks/' cylc/flow/scheduler.py.

Closes:
Makes the following changes:
- Held tasks are no longer be released from queues.
- pre_prep_tasks (previously pre_submit_tasks) are now included
  with active tasks for the computation of queue limits.
- Queues are no longer processed whilst the workflow is paused.
User's should now be able to safely hold/release tasks:
- When they are not in the pool (future tasks).
- When they are in the pool but not yet queued.
- When they are queued.
- When they are in pre_prep_tasks (previously pre_submit_tasks) which
  is an intermediary state tasks pass through after they have been
  released form the queue but before they are passed into the job
  preparation pipeline (and acquired the preparing job status).
Tasks can also be held whilst they are preparing, submitted & running,
however, this will continue to have no effect (except on automatic
retries, note cylc kill).

Requirements check-list

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg and conda-environment.yml.
Appropriate tests are included (unit and/or functional).
Appropriate change log entry included.
No documentation update required.

oliver-sanders · 2022-01-27T18:07:23Z

Tested manually @ 124b2b2 on Mac OS, passed the following:

Unit
Integration
_local_background_indep_tcp
_remote_background_indep_tcp
_remote_background_indep_poll

MetRonnie · 2022-01-27T18:30:52Z

Haven't had a proper look at the code but I cannot reproduce the bug on this branch 👍

cylc/flow/scheduler.py

tests/integration/test_scheduler.py

oliver-sanders · 2022-01-28T11:11:54Z

FYI: I'm taking a look at preventing tasks being released from queues whilst the Scheduler is paused as I think with this proposed solution the held tasks are occupying queue slots (because they were held after being de-queued).

MetRonnie · 2022-01-28T12:10:36Z

For what it's worth / comparison purposes, here is what I came up with: master...MetRonnie:pause-hold-resume-bug

Only problem is you get this warning for some reason

WARNING - Unhandled jobs-submit output: 2022-01-28T12:01:13Z|1/foo/01|0|58817
WARNING - ('1', 'foo', '01')

Edit: Ah I didn't see you'd already come up with this here #4278 (comment)

oliver-sanders · 2022-01-28T12:43:54Z

For what it's worth / comparison purposes, here is what I came up with

Unfortunately that works because of faulty queueing logic (#4628), because pre_submit_tasks have been dequeued they should be occupying queue slots preventing other tasks from running. So if you were to hold all tasks, then release a few from the bottom of the queue, you would expect the workflow to hang with this logic because the first few held tasks would be dequeued and placed in pre_submit_tasks where they would get stuck (because they would never get into tasks_to_submit), but no new tasks can be dequeued until they have run, which they won't because they are held.

oliver-sanders · 2022-01-28T12:47:51Z

Unfortunately this issue has revealed further problems:

I think they can be fixed here fairly simply by:

Preventing held tasks from being dequeued.
- This has to be implemented at the queue level so will have to be done for every queue implementation sadly.
Preventing tasks from being dequeued whilst the scheduler is paused.
- This should address queues: limits are ignored when starting in paused mode #4627 and avoid held tasks getting into the queue whilst the scheduler is paused (for niceness).

oliver-sanders · 2022-01-28T16:35:01Z

Ok, implemented the above with one additional modification to make pre_prep_tasks count towards the queue limits.

Re-wrote the description to match the new approach.

MetRonnie

Looks good apart from the new integration test seems to be flaky on GH Actions

cylc/flow/task_queues/independent.py

tests/integration/test_scheduler.py

* Closes: * cylc#4278 * cylc#4627 * cylc#4628 * Makes the following changes: * Held tasks are no longer be released from queues. * `pre_prep_tasks` (previously `pre_submit_tasks`) are now included with active tasks for the computation of queue limits. * Queues are no longer processed whilst the workflow is paused. * User's should now be able to safely hold/release tasks: * When they are not in the pool (future tasks). * When they are in the pool but not yet queued. * When they are queued. * When they are in `pre_prep_tasks` (previously `pre_submit_tasks`) which is an intermediary state tasks pass through *after* they have been released form the queue but *before* they are passed into the job preparation pipeline (and acquired the preparing job status). * Tasks can also be held whilst they are preparing, submitted & running, however, this will continue to have no effect (except on automatic retries, note `cylc kill`).

oliver-sanders · 2022-01-31T13:57:14Z

@MetRonnie Haven't managed to replicate the flakyness locally, have pushed up a commit which I think will help.

It adds a new integration fixture that allows us to start workflows without running the main loop which should remove an unnecessary moving part. Can't say if that was causing the issue though.

* Preserves the existing `run` fixture. * Adds a new `start` fixture which does everything `run` does, except running the main loop which should be unnecessary for most integration test purposes. * This reduces the number of moving parts and avoids the main loop interacting with tests in unintended ways.

oliver-sanders · 2022-01-31T14:27:26Z

Looks like it worked, re-running to be safe...

oliver-sanders · 2022-01-31T16:24:43Z

Test failure in 4/4 totally unrelated - #4633

MetRonnie · 2022-01-31T16:49:18Z

tests/integration/test_data_store_mgr.py

+    # put things back the way we found them
+    for itask in schd.pool.get_all_tasks():
+        itask.state.reset(TASK_STATUS_WAITING)
+        schd.data_store_mgr.delta_task_state(itask)
+    await schd.update_data_structure()


Less hacky way might be to add a non-module scoped version of the harness fixture and use that instead? We shouldn't really be mutating module scoped fixture data

This test needs a larger re-think - #4175 - so I've just patched it for now.

The tests actually rely on previous tests mutating the data.

Someone needs to go through and straighten them out but I'm not 100% on the interactions it is trying to cover.

wxtim

Partial review posted to show that I'm looking

Checking against three tickets marked as closed by change.
Check the code changes.
Check the test changes.
Check the source against the bullet points in the PR description
Had a really good go at breaking this logic

edit
meant to post this as a comment not an approval - have re-requested my review.

hjoliver

The fixed logic looks good. Had a good play with it, no problems found 👍

oliver-sanders added this to the cylc-8.0rc1 milestone Jan 27, 2022

oliver-sanders requested a review from hjoliver January 27, 2022 14:55

oliver-sanders self-assigned this Jan 27, 2022

oliver-sanders added the bug Something is wrong :( label Jan 27, 2022

oliver-sanders requested a review from MetRonnie January 27, 2022 14:57

oliver-sanders changed the title ~~hold/release: prevnet held tasks running when workflow resumed~~ hold/release: prevent held tasks running when workflow resumed Jan 27, 2022

MetRonnie reviewed Jan 27, 2022

View reviewed changes

cylc/flow/scheduler.py Outdated Show resolved Hide resolved

tests/integration/test_scheduler.py Show resolved Hide resolved

oliver-sanders marked this pull request as draft January 28, 2022 12:48

oliver-sanders mentioned this pull request Jan 28, 2022

queues: hold is ignored for queued tasks #4628

Closed

oliver-sanders force-pushed the 4278 branch from 04bcb3d to c666004 Compare January 28, 2022 16:20

oliver-sanders changed the title ~~hold/release: prevent held tasks running when workflow resumed~~ queues: fix interactions with the scheduler paused and task held states Jan 28, 2022

oliver-sanders force-pushed the 4278 branch from c666004 to 255980a Compare January 28, 2022 16:41

oliver-sanders marked this pull request as ready for review January 28, 2022 16:42

MetRonnie reviewed Jan 28, 2022

View reviewed changes

cylc/flow/task_queues/independent.py Show resolved Hide resolved

tests/integration/test_scheduler.py Show resolved Hide resolved

oliver-sanders force-pushed the 4278 branch from 255980a to 43ff803 Compare January 31, 2022 13:52

oliver-sanders force-pushed the 4278 branch from 43ff803 to c89f6d0 Compare January 31, 2022 13:55

oliver-sanders force-pushed the 4278 branch from c89f6d0 to d098920 Compare January 31, 2022 14:27

MetRonnie approved these changes Jan 31, 2022

View reviewed changes

oliver-sanders mentioned this pull request Jan 31, 2022

tests/integration: data store mgr tests unstable #4175

Open

This was linked to issues Feb 4, 2022

Resuming after play --pause can make held tasks run #4278

Closed

queues: limits are ignored when starting in paused mode #4627

Closed

queues: hold is ignored for queued tasks #4628

Closed

oliver-sanders requested a review from wxtim February 9, 2022 11:40

Merge branch 'master' into 4278

3f8fb1b

wxtim approved these changes Feb 9, 2022

View reviewed changes

wxtim self-requested a review February 9, 2022 13:13

Merge branch 'master' into 4278

b102fa3

wxtim approved these changes Feb 9, 2022

View reviewed changes

hjoliver approved these changes Feb 11, 2022

View reviewed changes

hjoliver merged commit 9a10c85 into cylc:master Feb 11, 2022

oliver-sanders deleted the 4278 branch February 11, 2022 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

queues: fix interactions with the scheduler paused and task held states #4620

queues: fix interactions with the scheduler paused and task held states #4620

oliver-sanders commented Jan 27, 2022 •

edited

Loading

oliver-sanders commented Jan 27, 2022

MetRonnie commented Jan 27, 2022

oliver-sanders commented Jan 28, 2022

MetRonnie commented Jan 28, 2022 •

edited

Loading

oliver-sanders commented Jan 28, 2022

oliver-sanders commented Jan 28, 2022

oliver-sanders commented Jan 28, 2022

MetRonnie left a comment

oliver-sanders commented Jan 31, 2022

oliver-sanders commented Jan 31, 2022

oliver-sanders commented Jan 31, 2022

MetRonnie Jan 31, 2022

oliver-sanders Jan 31, 2022 •

edited

Loading

wxtim left a comment •

edited

Loading

hjoliver left a comment

queues: fix interactions with the scheduler paused and task held states #4620

queues: fix interactions with the scheduler paused and task held states #4620

Conversation

oliver-sanders commented Jan 27, 2022 • edited Loading

oliver-sanders commented Jan 27, 2022

MetRonnie commented Jan 27, 2022

oliver-sanders commented Jan 28, 2022

MetRonnie commented Jan 28, 2022 • edited Loading

oliver-sanders commented Jan 28, 2022

oliver-sanders commented Jan 28, 2022

oliver-sanders commented Jan 28, 2022

MetRonnie left a comment

Choose a reason for hiding this comment

oliver-sanders commented Jan 31, 2022

oliver-sanders commented Jan 31, 2022

oliver-sanders commented Jan 31, 2022

MetRonnie Jan 31, 2022

Choose a reason for hiding this comment

oliver-sanders Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

wxtim left a comment • edited Loading

Choose a reason for hiding this comment

hjoliver left a comment

Choose a reason for hiding this comment

oliver-sanders commented Jan 27, 2022 •

edited

Loading

MetRonnie commented Jan 28, 2022 •

edited

Loading

oliver-sanders Jan 31, 2022 •

edited

Loading

wxtim left a comment •

edited

Loading