Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get poll to return task failure if job/log has been removed. #6577

Merged
merged 11 commits into from
Feb 18, 2025

Conversation

wxtim
Copy link
Member

@wxtim wxtim commented Jan 28, 2025

Closes #6425

Note

Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems.

Finally have a replicable example (Thank you @oliver-sanders)

[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task

[runtime]
    [[task]]
        script = """
            rm ${CYLC_WORKFLOW_RUN_DIR}/.service/contact
            rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
        """
        platform = _remote_pbs

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@wxtim wxtim self-assigned this Jan 28, 2025
@wxtim wxtim marked this pull request as draft January 28, 2025 16:17
@wxtim wxtim added this to the 8.4.1 milestone Jan 28, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only test__job_poll_status_files_deleted_logdir is directly related to the PR. Other tests should increase coverage. 😄

@wxtim wxtim force-pushed the fix.handle_deletion_of_job_logs branch 2 times, most recently from 2661a36 to 01c7cb8 Compare January 28, 2025 17:30
@wxtim wxtim requested review from MetRonnie and oliver-sanders and removed request for MetRonnie January 28, 2025 17:30
@wxtim wxtim marked this pull request as ready for review January 28, 2025 17:30
@wxtim wxtim marked this pull request as draft January 29, 2025 09:24
added unit tests for JobRunnerMgr._jobs_poll_status_files

test the task_job_mgr end
@wxtim wxtim force-pushed the fix.handle_deletion_of_job_logs branch from 01c7cb8 to 31fd08f Compare January 29, 2025 09:27
@wxtim wxtim requested a review from MetRonnie January 29, 2025 11:04
@wxtim wxtim marked this pull request as ready for review January 29, 2025 11:04
@wxtim wxtim linked an issue Jan 29, 2025 that may be closed by this pull request
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@oliver-sanders
Copy link
Member

Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems.

Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>
@wxtim wxtim force-pushed the fix.handle_deletion_of_job_logs branch from 54afde3 to 0c110b3 Compare February 3, 2025 11:52
@wxtim wxtim requested a review from oliver-sanders February 3, 2025 15:01
@wxtim wxtim requested a review from MetRonnie February 11, 2025 13:25
@wxtim wxtim force-pushed the fix.handle_deletion_of_job_logs branch from adeb3c0 to b14c460 Compare February 11, 2025 13:26
wxtim and others added 2 commits February 11, 2025 16:22
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
@wxtim wxtim requested a review from MetRonnie February 11, 2025 16:22
@MetRonnie
Copy link
Member

MetRonnie commented Feb 13, 2025

From the original issue:

If you delete the job log directory for an active task, Cylc will preserve its last known status indefinitely. I.e, Cylc will consider the job to be submitted/running forever.

This does not seem to be true. It is only true if the job log dir and the contact file is removed.

This situation should be handled similarly to the job no longer appearing in the queue, i.e, the job is dead, long live the job. Stick it into the failed/submit-failed state as appropriate.

The job can succeed even if its job log dir is removed from under its feet.

There seems to be a problem here where the job log retrieval process keeps retrying, preventing shutdown without the --now --now option.

[runtime]
    [[task]]
        script = """
            rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
        """
        platform = <remote PBS>
        execution time limit = PT1M
        [[[directives]]]
            -q = shared
            -l ncpus = 1
            -l mem = 100mb

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 13, 2025

This does not seem to be true. It is only true if the job log dir and the contact file is removed.

Hmmm, we've seen this with PBS several times.

If you delete to job dir for a submitted task, it cannot run (no job script), so that'll definitely do it.

For a running job, not so sure.

The job can succeed even if its job log dir is removed from under its feet.

Possibly. But it will likely fail due to IO error and we cannot guarantee that Cylc will be informed of the job's outcome.

... From testing, it looks like echo foo doesn't cause the job to fail due to the missing file which makes sense for PBS as it "spools" the job output files in a temp dir so you have to try harder to delete them.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Code makes sense.
  • Tested against the example in the OP, works well, error message clearly logged.

We should talk this one over in tomorrow's meeting.

@MetRonnie
Copy link
Member

MetRonnie commented Feb 13, 2025

Another example:

rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
sleep 10
exit 1

This task "fails successfully" (this PR as it currently stands makes no difference) but then the workflow hangs on shutdown after cylc stop.

Edit: the length of the hang depends on the global.cylc[platforms][<name>]retrieve job logs retry delays (it does not hang after the final retry). However, it is unclear to the user what is going on.

@MetRonnie
Copy link
Member

MetRonnie commented Feb 14, 2025

Discussed today:

  • Although the job may yet succeed even if the job log dir is deleted, we have decided that it is best to put in the failed state as this is the best we can do if we can't poll anymore.
  • We will leave the job log retrieval as it is, as it's user error to delete the job log dir prematurely they will have to suffer the consequences

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some suggestions at wxtim#72

MetRonnie and others added 3 commits February 14, 2025 16:54
Simplify poll handling of prematurely deleted job log dir
@wxtim
Copy link
Member Author

wxtim commented Feb 17, 2025

I have some suggestions at wxtim#72

I'm happy with your suggestions. You wield your scapel with a confident hand. I like it.

@MetRonnie MetRonnie self-assigned this Feb 17, 2025
@oliver-sanders oliver-sanders merged commit 21d18ba into cylc:8.4.x Feb 18, 2025
27 checks passed
@wxtim wxtim deleted the fix.handle_deletion_of_job_logs branch February 18, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

handle job log directory deleted for active task
3 participants