-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NN for remote jobs #1089
Fix NN for remote jobs #1089
Conversation
Add tests to ensure NN behaves correctly to submit number equals 100.
(Not yet ready for merge, as I am unable to test this until I am at work tomorrow morning.) |
@hjoliver please review. N.B. the following are failing on master and on this branch, and I believe they are independent of this fix:
|
@@ -40,6 +42,7 @@ class background( JobSubmit ): | |||
"; " + | |||
# Retry "mkdir" once to avoid race to create log/job/CYCLE/ | |||
" (mkdir -p %(jobfile_dir)s || mkdir -p %(jobfile_dir)s)" + | |||
" && ln -fs $(basename %(jobfile_dir)s) $(dirname %(jobfile_dir)s)/NN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my tests, this does not have the intended effect after the first submit, e.g for a task that tried 4 times:
$ ls -l junk/log/job/1/foo/
total 320
drwxr-xr-x 2 oliverh niwa-users 32768 Aug 17 21:47 01
drwxr-xr-x 2 oliverh niwa-users 32768 Aug 17 21:22 02
drwxr-xr-x 2 oliverh niwa-users 32768 Aug 17 21:22 03
drwxr-xr-x 2 oliverh niwa-users 32768 Aug 17 21:43 04
lrwxrwxrwx 1 oliverh niwa-users 2 Aug 17 21:45 NN -> 01
and
$ ls -l junk/log/job/1/foo/NN/
total 320
lrwxrwxrwx 1 oliverh niwa-users 2 Aug 17 21:46 02 -> 02
lrwxrwxrwx 1 oliverh niwa-users 2 Aug 17 21:47 03 -> 03
lrwxrwxrwx 1 oliverh niwa-users 2 Aug 17 21:47 04 -> 04
-rwxr-xr-x 1 oliverh niwa-users 3864 Aug 17 21:30 job
-rw-r--r-- 1 oliverh niwa-users 217 Aug 17 21:30 job.err
-rw-r--r-- 1 oliverh niwa-users 324 Aug 17 21:30 job.out
-rw-r--r-- 1 oliverh niwa-users 115 Aug 17 21:30 job.status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch. Is this on a shared file system between the suite host and the job host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no it wasn't, but I should check that too...)
Problem should now be fixed. New test added. |
As commented above, my original problem did not involve a shared filesystem ... maybe a different or buggy version of |
Also add tests to ensure
NN
behaves correctly to submit number equals 100.(For some reason,
log/job/${CYCLE}/${TASK}/NN
is not getting created for remote jobs when I submitted #1069. I am certain I tested this sort of things at various points while I maintained the branch, but that might have been lost after so many merge and re-base operations.)