-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get poll to return task failure if job/log has been removed. #6577
Get poll to return task failure if job/log has been removed. #6577
Conversation
tests/unit/test_job_runner_mgr.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only test__job_poll_status_files_deleted_logdir
is directly related to the PR. Other tests should increase coverage. 😄
2661a36
to
01c7cb8
Compare
added unit tests for JobRunnerMgr._jobs_poll_status_files test the task_job_mgr end
01c7cb8
to
31fd08f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems. |
Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>
54afde3
to
0c110b3
Compare
adeb3c0
to
b14c460
Compare
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
From the original issue:
This does not seem to be true. It is only true if the job log dir and the contact file is removed.
The job can succeed even if its job log dir is removed from under its feet. There seems to be a problem here where the job log retrieval process keeps retrying, preventing shutdown without the [runtime]
[[task]]
script = """
rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
"""
platform = <remote PBS>
execution time limit = PT1M
[[[directives]]]
-q = shared
-l ncpus = 1
-l mem = 100mb |
Hmmm, we've seen this with PBS several times. If you delete to job dir for a submitted task, it cannot run (no job script), so that'll definitely do it. For a running job, not so sure.
Possibly. But it will likely fail due to IO error and we cannot guarantee that Cylc will be informed of the job's outcome. ... From testing, it looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Code makes sense.
- Tested against the example in the OP, works well, error message clearly logged.
We should talk this one over in tomorrow's meeting.
Another example:
This task "fails successfully" (this PR as it currently stands makes no difference) but then the workflow hangs on shutdown after Edit: the length of the hang depends on the |
Discussed today:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some suggestions at wxtim#72
Simplify poll handling of prematurely deleted job log dir
I'm happy with your suggestions. You wield your scapel with a confident hand. I like it. |
Closes #6425
Note
Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems.
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.