Get poll to return task failure if job/log has been removed. #6577

wxtim · 2025-01-28T16:17:00Z

Note

Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems.

Finally have a replicable example (Thank you @oliver-sanders)

[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task

[runtime]
    [[task]]
        script = """
            rm ${CYLC_WORKFLOW_RUN_DIR}/.service/contact
            rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
        """
        platform = _remote_pbs

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

wxtim · 2025-01-28T17:07:47Z

tests/unit/test_job_runner_mgr.py

Only test__job_poll_status_files_deleted_logdir is directly related to the PR. Other tests should increase coverage. 😄

added unit tests for JobRunnerMgr._jobs_poll_status_files test the task_job_mgr end

oliver-sanders

LGTM.

cylc/flow/task_job_mgr.py

cylc/flow/job_runner_mgr.py

oliver-sanders · 2025-02-03T11:35:20Z

Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems.

Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>

changes.d/6577.fix.md

cylc/flow/job_runner_mgr.py

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

MetRonnie · 2025-02-13T12:39:20Z

From the original issue:

If you delete the job log directory for an active task, Cylc will preserve its last known status indefinitely. I.e, Cylc will consider the job to be submitted/running forever.

This does not seem to be true. It is only true if the job log dir and the contact file is removed.

This situation should be handled similarly to the job no longer appearing in the queue, i.e, the job is dead, long live the job. Stick it into the failed/submit-failed state as appropriate.

The job can succeed even if its job log dir is removed from under its feet.

There seems to be a problem here where the job log retrieval process keeps retrying, preventing shutdown without the --now --now option.

[runtime]
    [[task]]
        script = """
            rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
        """
        platform = <remote PBS>
        execution time limit = PT1M
        [[[directives]]]
            -q = shared
            -l ncpus = 1
            -l mem = 100mb

oliver-sanders · 2025-02-13T14:05:54Z

This does not seem to be true. It is only true if the job log dir and the contact file is removed.

Hmmm, we've seen this with PBS several times.

If you delete to job dir for a submitted task, it cannot run (no job script), so that'll definitely do it.

For a running job, not so sure.

The job can succeed even if its job log dir is removed from under its feet.

Possibly. But it will likely fail due to IO error and we cannot guarantee that Cylc will be informed of the job's outcome.

... From testing, it looks like echo foo doesn't cause the job to fail due to the missing file which makes sense for PBS as it "spools" the job output files in a temp dir so you have to try harder to delete them.

oliver-sanders

Code makes sense.
Tested against the example in the OP, works well, error message clearly logged.

We should talk this one over in tomorrow's meeting.

MetRonnie · 2025-02-13T17:51:08Z

Another example:

rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
sleep 10
exit 1

This task "fails successfully" (this PR as it currently stands makes no difference) but then the workflow hangs on shutdown after cylc stop.

Edit: the length of the hang depends on the global.cylc[platforms][<name>]retrieve job logs retry delays (it does not hang after the final retry). However, it is unclear to the user what is going on.

MetRonnie · 2025-02-14T16:43:37Z

Discussed today:

Although the job may yet succeed even if the job log dir is deleted, we have decided that it is best to put in the failed state as this is the best we can do if we can't poll anymore.
We will leave the job log retrieval as it is, as it's user error to delete the job log dir prematurely they will have to suffer the consequences

MetRonnie

I have some suggestions at wxtim#72

Simplify poll handling of prematurely deleted job log dir

wxtim · 2025-02-17T11:14:47Z

I have some suggestions at wxtim#72

I'm happy with your suggestions. You wield your scapel with a confident hand. I like it.

wxtim self-assigned this Jan 28, 2025

wxtim marked this pull request as draft January 28, 2025 16:17

wxtim added this to the 8.4.1 milestone Jan 28, 2025

wxtim commented Jan 28, 2025

View reviewed changes

wxtim force-pushed the fix.handle_deletion_of_job_logs branch 2 times, most recently from 2661a36 to 01c7cb8 Compare January 28, 2025 17:30

wxtim requested review from MetRonnie and oliver-sanders and removed request for MetRonnie January 28, 2025 17:30

wxtim marked this pull request as ready for review January 28, 2025 17:30

wxtim marked this pull request as draft January 29, 2025 09:24

Get poll to return task failure if job/log has been removed.

31fd08f

added unit tests for JobRunnerMgr._jobs_poll_status_files test the task_job_mgr end

wxtim force-pushed the fix.handle_deletion_of_job_logs branch from 01c7cb8 to 31fd08f Compare January 29, 2025 09:27

wxtim requested a review from MetRonnie January 29, 2025 11:04

wxtim marked this pull request as ready for review January 29, 2025 11:04

wxtim linked an issue Jan 29, 2025 that may be closed by this pull request

handle job log directory deleted for active task #6425

Closed

oliver-sanders reviewed Feb 3, 2025

View reviewed changes

cylc/flow/task_job_mgr.py Outdated Show resolved Hide resolved

cylc/flow/job_runner_mgr.py Outdated Show resolved Hide resolved

Update cylc/flow/task_job_mgr.py

0c110b3

Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>

wxtim force-pushed the fix.handle_deletion_of_job_logs branch from 54afde3 to 0c110b3 Compare February 3, 2025 11:52

set JOB_FILES_REMOVED_MESSAGE

3a9cd6a

wxtim requested a review from oliver-sanders February 3, 2025 15:01

MetRonnie reviewed Feb 11, 2025

View reviewed changes

changes.d/6577.fix.md Outdated Show resolved Hide resolved

cylc/flow/job_runner_mgr.py Outdated Show resolved Hide resolved

wxtim requested a review from MetRonnie February 11, 2025 13:25

fix tests

b14c460

wxtim force-pushed the fix.handle_deletion_of_job_logs branch from adeb3c0 to b14c460 Compare February 11, 2025 13:26

MetRonnie reviewed Feb 11, 2025

View reviewed changes

cylc/flow/job_runner_mgr.py Outdated Show resolved Hide resolved

wxtim and others added 2 commits February 11, 2025 16:22

Update cylc/flow/job_runner_mgr.py

3ec745c

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

Update changes.d/6577.fix.md

7385b58

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

wxtim requested a review from MetRonnie February 11, 2025 16:22

oliver-sanders reviewed Feb 13, 2025

View reviewed changes

Simplify poll handling of prematurely deleted job log dir

17156aa

MetRonnie reviewed Feb 14, 2025

View reviewed changes

MetRonnie mentioned this pull request Feb 14, 2025

handle job log directory deleted for active task #6425

Closed

MetRonnie and others added 3 commits February 14, 2025 16:54

Mypy

aa4312f

Simplify too-flaky test

63fbf9c

Merge pull request #72 from MetRonnie/job-log-dir

749ff2f

Simplify poll handling of prematurely deleted job log dir

Merge branch '8.4.x' into fix.handle_deletion_of_job_logs

0a73217

MetRonnie self-assigned this Feb 17, 2025

MetRonnie approved these changes Feb 17, 2025

View reviewed changes

MetRonnie requested a review from oliver-sanders February 17, 2025 12:56

oliver-sanders approved these changes Feb 18, 2025

View reviewed changes

oliver-sanders merged commit 21d18ba into cylc:8.4.x Feb 18, 2025
27 checks passed

wxtim deleted the fix.handle_deletion_of_job_logs branch February 18, 2025 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get poll to return task failure if job/log has been removed. #6577

Get poll to return task failure if job/log has been removed. #6577

wxtim commented Jan 28, 2025 •

edited by MetRonnie

Loading

wxtim Jan 28, 2025

oliver-sanders left a comment

oliver-sanders commented Feb 3, 2025

MetRonnie commented Feb 13, 2025 •

edited

Loading

oliver-sanders commented Feb 13, 2025 •

edited

Loading

oliver-sanders left a comment

MetRonnie commented Feb 13, 2025 •

edited

Loading

MetRonnie commented Feb 14, 2025 •

edited

Loading

MetRonnie left a comment

wxtim commented Feb 17, 2025

Get poll to return task failure if job/log has been removed. #6577

Get poll to return task failure if job/log has been removed. #6577

Conversation

wxtim commented Jan 28, 2025 • edited by MetRonnie Loading

wxtim Jan 28, 2025

Choose a reason for hiding this comment

oliver-sanders left a comment

Choose a reason for hiding this comment

oliver-sanders commented Feb 3, 2025

MetRonnie commented Feb 13, 2025 • edited Loading

oliver-sanders commented Feb 13, 2025 • edited Loading

oliver-sanders left a comment

Choose a reason for hiding this comment

MetRonnie commented Feb 13, 2025 • edited Loading

MetRonnie commented Feb 14, 2025 • edited Loading

MetRonnie left a comment

Choose a reason for hiding this comment

wxtim commented Feb 17, 2025

wxtim commented Jan 28, 2025 •

edited by MetRonnie

Loading

MetRonnie commented Feb 13, 2025 •

edited

Loading

oliver-sanders commented Feb 13, 2025 •

edited

Loading

MetRonnie commented Feb 13, 2025 •

edited

Loading

MetRonnie commented Feb 14, 2025 •

edited

Loading