Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queues: fix interactions with the scheduler paused and task held states #4620

Merged
merged 4 commits into from
Feb 11, 2022

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Jan 27, 2022

[edit] pushed up a new commit with a new approach after discovering the problem went deeper than initially thought.

There are now three issues each has its own steps to reproduce and integration test.
The new integration tests run on master but you will need to overwite tests/integration/conftest.py and sed -i 's/pre_submit_tasks/prep_submit_tasks/' cylc/flow/scheduler.py.

  • Closes:
  • Makes the following changes:
    • Held tasks are no longer be released from queues.
    • pre_prep_tasks (previously pre_submit_tasks) are now included
      with active tasks for the computation of queue limits.
    • Queues are no longer processed whilst the workflow is paused.
  • User's should now be able to safely hold/release tasks:
    • When they are not in the pool (future tasks).
    • When they are in the pool but not yet queued.
    • When they are queued.
    • When they are in pre_prep_tasks (previously pre_submit_tasks) which
      is an intermediary state tasks pass through after they have been
      released form the queue but before they are passed into the job
      preparation pipeline (and acquired the preparing job status).
  • Tasks can also be held whilst they are preparing, submitted & running,
    however, this will continue to have no effect (except on automatic
    retries, note cylc kill).

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • No documentation update required.

@oliver-sanders oliver-sanders added this to the cylc-8.0rc1 milestone Jan 27, 2022
@oliver-sanders oliver-sanders self-assigned this Jan 27, 2022
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Jan 27, 2022
@oliver-sanders oliver-sanders changed the title hold/release: prevnet held tasks running when workflow resumed hold/release: prevent held tasks running when workflow resumed Jan 27, 2022
@oliver-sanders
Copy link
Member Author

Tested manually @ 124b2b2 on Mac OS, passed the following:

  • Unit
  • Integration
  • _local_background_indep_tcp
  • _remote_background_indep_tcp
  • _remote_background_indep_poll

@MetRonnie
Copy link
Member

Haven't had a proper look at the code but I cannot reproduce the bug on this branch 👍

@oliver-sanders
Copy link
Member Author

FYI: I'm taking a look at preventing tasks being released from queues whilst the Scheduler is paused as I think with this proposed solution the held tasks are occupying queue slots (because they were held after being de-queued).

@MetRonnie
Copy link
Member

MetRonnie commented Jan 28, 2022

For what it's worth / comparison purposes, here is what I came up with: master...MetRonnie:pause-hold-resume-bug

Only problem is you get this warning for some reason

WARNING - Unhandled jobs-submit output: 2022-01-28T12:01:13Z|1/foo/01|0|58817
WARNING - ('1', 'foo', '01')

Edit: Ah I didn't see you'd already come up with this here #4278 (comment)

@oliver-sanders
Copy link
Member Author

For what it's worth / comparison purposes, here is what I came up with

Unfortunately that works because of faulty queueing logic (#4628), because pre_submit_tasks have been dequeued they should be occupying queue slots preventing other tasks from running. So if you were to hold all tasks, then release a few from the bottom of the queue, you would expect the workflow to hang with this logic because the first few held tasks would be dequeued and placed in pre_submit_tasks where they would get stuck (because they would never get into tasks_to_submit), but no new tasks can be dequeued until they have run, which they won't because they are held.

@oliver-sanders
Copy link
Member Author

Unfortunately this issue has revealed further problems:

I think they can be fixed here fairly simply by:

  • Preventing held tasks from being dequeued.
    • This has to be implemented at the queue level so will have to be done for every queue implementation sadly.
  • Preventing tasks from being dequeued whilst the scheduler is paused.

@oliver-sanders oliver-sanders marked this pull request as draft January 28, 2022 12:48
@oliver-sanders oliver-sanders changed the title hold/release: prevent held tasks running when workflow resumed queues: fix interactions with the scheduler paused and task held states Jan 28, 2022
@oliver-sanders
Copy link
Member Author

Ok, implemented the above with one additional modification to make pre_prep_tasks count towards the queue limits.

Re-wrote the description to match the new approach.

@oliver-sanders oliver-sanders marked this pull request as ready for review January 28, 2022 16:42
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good apart from the new integration test seems to be flaky on GH Actions

* Closes:
  * cylc#4278
  * cylc#4627
  * cylc#4628
* Makes the following changes:
  * Held tasks are no longer be released from queues.
  * `pre_prep_tasks` (previously `pre_submit_tasks`) are now included
    with active tasks for the computation of queue limits.
  * Queues are no longer processed whilst the workflow is paused.
* User's should now be able to safely hold/release tasks:
  * When they are not in the pool (future tasks).
  * When they are in the pool but not yet queued.
  * When they are queued.
  * When they are in `pre_prep_tasks` (previously `pre_submit_tasks`) which
    is an intermediary state tasks pass through *after* they have been
    released form the queue but *before* they are passed into the job
    preparation pipeline (and acquired the preparing job status).
* Tasks can also be held whilst they are preparing, submitted & running,
  however, this will continue to have no effect (except on automatic
  retries, note `cylc kill`).
@oliver-sanders
Copy link
Member Author

@MetRonnie Haven't managed to replicate the flakyness locally, have pushed up a commit which I think will help.

It adds a new integration fixture that allows us to start workflows without running the main loop which should remove an unnecessary moving part. Can't say if that was causing the issue though.

* Preserves the existing `run` fixture.
* Adds a new `start` fixture which does everything `run` does, except
  running the main loop which should be unnecessary for most integration test
  purposes.
* This reduces the number of moving parts and avoids the main loop
  interacting with tests in unintended ways.
@oliver-sanders
Copy link
Member Author

Looks like it worked, re-running to be safe...

@oliver-sanders
Copy link
Member Author

Test failure in 4/4 totally unrelated - #4633

Comment on lines +179 to +183
# put things back the way we found them
for itask in schd.pool.get_all_tasks():
itask.state.reset(TASK_STATUS_WAITING)
schd.data_store_mgr.delta_task_state(itask)
await schd.update_data_structure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less hacky way might be to add a non-module scoped version of the harness fixture and use that instead? We shouldn't really be mutating module scoped fixture data

Copy link
Member Author

@oliver-sanders oliver-sanders Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs a larger re-think - #4175 - so I've just patched it for now.

The tests actually rely on previous tests mutating the data.

Someone needs to go through and straighten them out but I'm not 100% on the interactions it is trying to cover.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review posted to show that I'm looking

  • Checking against three tickets marked as closed by change.
  • Check the code changes.
  • Check the test changes.
  • Check the source against the bullet points in the PR description
  • Had a really good go at breaking this logic

edit
meant to post this as a comment not an approval - have re-requested my review.

@wxtim wxtim self-requested a review February 9, 2022 13:13
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixed logic looks good. Had a good play with it, no problems found 👍

@hjoliver hjoliver merged commit 9a10c85 into cylc:master Feb 11, 2022
@oliver-sanders oliver-sanders deleted the 4278 branch February 11, 2022 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
4 participants