Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: execute user job scripts in subshell #2885

Closed
wants to merge 1 commit into from

Conversation

TomekTrzeciak
Copy link
Contributor

Add execution of user job scripts in subshell to isolate Cylc job script from environment changes in user scripts.

@hjoliver
Copy link
Member

The trouble with this is, we occasionally want to deliberately modify the cylc environment. The test battery is failing, for instance (well, maybe this is the only use case) because job scripts in some tests unset the EXIT trap in order to exit early without triggering a task failed message.

@matthewrmshin
Copy link
Contributor

Correctly doing this should help address #2043. But we need to considerate a few things:

  • In the general case, we want to run the main scripts in a sub-shell with only set -eu. It will exit with a non-zero code on error. The root shell can pick that up and handle messaging, etc.
  • We need mechanism to handle specific user aborts, etc.
  • We need mechanism to handle batch system signals - do we get signal for the process group or do we get signal for only the root shell?

@hjoliver
Copy link
Member

hjoliver commented Nov 27, 2018

(
From #2043 -

If TERM is not trapped then the script exits immediately (and executes any EXIT trap).
If TERM is trapped then the trap is not executed until any child process completes.

This (2nd point) had me scratching my head for a bit while testing the new exit-script PR. It looked like TERM was not being trapped at all and I couldn't see why, but it was just that my test job was a single long sleep ... which had to complete before the trap was executed.
)

@TomekTrzeciak
Copy link
Contributor Author

@hjoliver, @matthewrmshin, feel free to close this PR if you think it's not appropriate. I just thought the current setup, where the user code can trample over Cylc environment with no restrictions is not very robust.

@hjoliver
Copy link
Member

@TomekTrzeciak - the subshell is better in that respect, we just need to make sure we can handle all use cases...

@hjoliver
Copy link
Member

where the user code can trample over Cylc environment with no restrictions is not very robust.

I agree in principle, but we do deliberately prefix all Cylc environment variables with CYLC_... so I'm not sure what a user would have to be doing to trample all over those. Perhaps other things, like messing with signal traps, could cause trouble though. Do you have examples of problems caused by this?

@TomekTrzeciak
Copy link
Contributor Author

An example from the past was using module load ... that swapped Python interpreter and interfered with Cylc. This was solved on our site by changing shebang line to #!/usr/bin/python2 for Cylc utils, which currently points at a pretty old Python 2.6. But what if you want to execute a sub-suite with a newer interpreter (e.g., because your Jinja2 filter uses packages that require it).

As you mentioned, you may want to set your own traps for signals. There are also system environment variables that can muck around with your execution environment in unpredictable ways (like LD_PRELOAD). Allowing user code to run without separation from Cylc code just makes things fragile and limits what you can do in your job script.

@hjoliver
Copy link
Member

@TomekTrzeciak - fair enough (I've seen the module load problem too actually). So we definitely need subshell execution of user scripting. We just need to work through the complications described above, first.

@hjoliver
Copy link
Member

@matthewrmshin - you probably have the best understanding of the complications you refer to above #2885 (comment) do you think they can be addressed easily enough now, or should we close this PR for now and reference it in a new Issue for later development?

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Jan 10, 2019

Ref #2885 (comment) Bullet point 1 and 3 should be easy, but require some careful testing. Bullet point 2 is tricky to do correctly (need to trap user abort only once).

@TomekTrzeciak Do you want to continue to pursue this PR? (I can talk you through the issues.) Or do you want to bury this for now?

@TomekTrzeciak
Copy link
Contributor Author

@matthewrmshin I'm still interested in this, but depends a bit on how much work it would require. As for bullet point 2 from your list (user aborts), how about combining it with point 3 (batch system signals) in the following way:

  1. Allow cylc kill --signal [SIG] REG TASKID... to send a signal SIG to the batch job for batch systems that support it (many do), otherwise the command should fail. With SIG omitted, send a default signal for that batch system or TERM
  2. Inside the job, allow cylc kill [--signal] [SIG] (no arguments) to pick up REG and TASKID info from the environment, so that it can be used to abort the job. All batch systems support killing jobs, but not all support sending arbitrary signals. In those cases I think the simplest option is to have an environment variable with PID of the controlling script and let the user address these situations (e.g. work out PPID or SID to send the signal to).

@hjoliver hjoliver changed the title execute user job scripts in subshell WIP: execute user job scripts in subshell Feb 21, 2019
@matthewrmshin
Copy link
Contributor

@TomekTrzeciak I'll close this PR down for now. Please raise a new PR if you want to continue with this work.

@matthewrmshin matthewrmshin removed this from the soon milestone Mar 11, 2019
@hjoliver
Copy link
Member

@matthewrmshin @TomekTrzeciak - I think we generally agree that the intent of this PR is desirable, so probably should open a new issue as a placeholder for post-cylc-8 work (unless @TomekTrzeciak can do it now, considering the complications we raised above)?

@matthewrmshin
Copy link
Contributor

Can probably add to #2043.

@hjoliver
Copy link
Member

Done. (Added to #2043)

@TomekTrzeciak
Copy link
Contributor Author

@hjoliver, @matthewrmshin, I'm unlikely to find the time for it this month. I could implement something that I think is right behaviour quite quickly, but getting all the testing and documentation sorted out at the same time to get this PR over the line would take me too long.

@hjoliver
Copy link
Member

@TomekTrzeciak - no worries, just flagging that we're up for it if you do happen to have the time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants