Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch sys -> job runner #180

Merged
merged 5 commits into from
Dec 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@ src/appendices/command-ref.rst
# auto-generated documentation
src/user-guide/plugins/main-loop/built-in
src/user-guide/batch-sys-handlers
src/user-guide/job-runner-handlers
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ clean:
# remove auto-generated content
rm -rf src/user-guide/plugins/main-loop/built-in
rm -rf src/user-guide/batch-sys-handlers
rm -rf src/user-guide/job-runner-handlers

cleanall:
(cd doc; echo [0-9]*.*)
Expand Down
65 changes: 56 additions & 9 deletions src/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -523,16 +523,29 @@ Glossary
* :term:`job submission number`

job host
The job host is the compute platform that a :term:`job` runs on. For
example ``some-host`` would be the job host for the task ``some-task`` in
the following suite:
The job host is the compute resource that a :term:`job` runs on. For
example ``node_1`` would be one of two possible job hosts on the
:term:`platform` ``my_hpc`` for the task ``some-task`` in the
following workflow:

.. code-block:: cylc
:caption: global.cylc

[platforms]
[[my_hpc]]
hosts = node_1, node_2
job runner = slurm

.. code-block:: cylc
:caption: flow.cylc

[runtime]
[[some-task]]
[[[remote]]]
host = some-host
platform = my_hpc

See also:

* :term:`platform`

job submission number
Cylc may run multiple :term:`jobs <job>` per :term:`task` (e.g. if the
Expand All @@ -545,9 +558,13 @@ Glossary
* :term:`job`
* :term:`job script`

job runner
batch system
A batch system or job scheduler is a system for submitting
:term:`jobs <job>` onto a compute platform.
A job runner (also known as batch system or job scheduler) is a system
for submitting :term:`jobs <job>` to a :term:`job platform <platform>`.

Job runners are set on a per-platform basis in
:cylc:conf:`global.cylc[platforms][<platform name>]job runner`.

See also:

Expand All @@ -556,15 +573,45 @@ Glossary
* :term:`directive`

directive
Directives are used by :term:`batch systems <batch system>` to determine
Directives are used by :term:`job runners <job runner>` to determine
what a :term:`job's <job>` requirements are, e.g. how much memory
it requires.

Directives are set in :cylc:conf:`[runtime][<namespace>][directives]`.

See also:

* :term:`batch system`
* :term:`job runner`

platform
job platform
A configured setup for running :term:`jobs <job>` on (usually remotely).
Platforms are primarily defined by the combination of a
:term:`job runner` and a group of :term:`hosts <job host>`
(which share a file system).

For example ``my_hpc`` could be the platform for the task ``some-task``
in the following workflow:

.. code-block:: cylc
:caption: global.cylc

[platforms]
[[my_hpc]]
hosts = node_1, node_2
job runner = slurm

.. code-block:: cylc
:caption: flow.cylc

[runtime]
[[some-task]]
platform = my_hpc

See also:

* :term:`job host`
* :term:`job runner`

scheduler
When we say that a :term:`suite` is "running" we mean that the cylc
Expand Down
12 changes: 6 additions & 6 deletions src/suite-design-guide/general-principles.rst
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ submission until the expected data arrival time:
Clock-triggered tasks typically have to handle late data arrival. Task
execution *retry delays* can be used to simply retrigger the task at
intervals until the data is found, but frequently retrying small tasks probably
should not go to a batch scheduler, and multiple task failures will be logged
should not go to a :term:`job runner`, and multiple task failures will be logged
for what is a essentially a normal condition (at least it is normal until the
data is really late).

Expand All @@ -300,9 +300,9 @@ so be sure to configure a reasonable interval between polls.
Task Execution Time Limits
--------------------------

Instead of setting job wall clock limits directly in batch scheduler
Instead of setting job wall clock limits directly in :term:`job runner`
directives, use the ``execution time limit`` suite config item.
Cylc automatically derives the correct batch scheduler directives from this,
Cylc automatically derives the correct job runner directives from this,
and it is also used to run ``background`` and ``at`` jobs via
the ``timeout`` command, and to poll tasks that haven't reported in
finished by the configured time limit.
Expand Down Expand Up @@ -439,8 +439,8 @@ by the vast majority of tasks. Over-sharing of via root, particularly of
environment variables, is a maintenance risk because it can be very
difficult to be sure which tasks are using which global variables.

Any :cylc:conf:`[runtime]` settings can be shared - scripting, host
and batch scheduler configuration, environment variables, and so on - from
Any :cylc:conf:`[runtime]` settings can be shared - scripting, platform
configuration, environment variables, and so on - from
single items up to complete task or app configurations. At the latter extreme,
it is quite common to have several tasks that inherit the same complete
job configuration followed by minor task-specific additions:
Expand Down Expand Up @@ -618,7 +618,7 @@ graph:
RUN_LEN = PT12H

The few differences between ``short_fc`` and ``long_fc``,
including batch scheduler resource requests, can be configured after common
including :term:`job runner` resource requests, can be configured after common
settings are inherited.

At Start-Up
Expand Down
2 changes: 2 additions & 0 deletions src/suite-design-guide/portable-suites.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
Portable Suites
===============

.. TODO - platformise all the examples in here

A *portable* or *interoperable* suite can run "out of the box" at
different sites, or in different environments such as research and operations
within a site. For convenience we just use the term *site portability*.
Expand Down
1 change: 1 addition & 0 deletions src/suites/inherit/single/one/flow.cylc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
OBS:succeed-all => bar
"""

# TODO: platformise
[runtime]
[[root]] # base namespace for all tasks (defines suite-wide defaults)
[[[job]]]
Expand Down
4 changes: 2 additions & 2 deletions src/tutorial/runtime/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,9 @@ Tasks And Jobs
Submitted
When a :term:`task's <task>` dependencies have been met it is ready for
submission. During this phase the :term:`job script` is created.
The :term:`job` is then submitted to the specified batch system.
The :term:`job` is then submitted to the specified :term:`job runner`.
There is more about this in the :ref:`next section
<tutorial-batch-system>`.
<tutorial-job-runner>`.
Running
A :term:`task` is in the "Running" state as soon as the :term:`job` is
executed.
Expand Down
19 changes: 10 additions & 9 deletions src/tutorial/runtime/runtime-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
Runtime Configuration
=====================

.. TODO - platformise all the examples in here

In the last section we associated tasks with scripts and ran a simple suite. In
this section we will look at how we can configure these tasks.

Expand Down Expand Up @@ -48,7 +50,7 @@ Environment Variables
* ``CYLC_TASK_CYCLE_POINT``


.. _tutorial-batch-system:
.. _tutorial-job-runner:

Job Submission
--------------
Expand Down Expand Up @@ -77,8 +79,8 @@ Job Submission

Cylc also executes jobs as `background processes`_ by default.
When we are running jobs on other compute hosts we will often want to
use a :term:`batch system` (`job scheduler`_) to submit our job.
Cylc supports the following :term:`batch systems <batch system>`:
use a :term:`job runner` to submit our job.
Cylc supports the following :term:`job runners <job runner>`:

* at
* loadleveler
Expand All @@ -92,9 +94,9 @@ Job Submission

.. ifnotslides::

:term:`Batch systems <batch system>` typically require
:term:`Job runners <job runner>` typically require
:term:`directives <directive>` in some form. :term:`Directives <directive>`
inform the :term:`batch system` of the requirements of a :term:`job`, for
inform the job runner of the requirements of a :term:`job`, for
example how much memory a given job requires or how many CPUs the job will
run on. For example:

Expand All @@ -108,7 +110,7 @@ Job Submission
[[[remote]]]
host = big-computer

# Submit the job using the "slurm" batch system.
# Submit the job using the "slurm" job runner.
[[[job]]]
batch system = slurm

Expand Down Expand Up @@ -196,7 +198,7 @@ Start, Stop, Restart
``cylc stop --kill``
When the ``--kill`` option is used Cylc will kill all running jobs
before stopping. *Cylc can kill jobs on remote hosts and uses the
appropriate command when a* :term:`batch system` *is used.*
appropriate command when a* :term:`job runner` *is used.*
``cylc stop --now --now``
When the ``--now`` option is used twice Cylc stops straight away, leaving
any jobs running.
Expand Down Expand Up @@ -286,7 +288,7 @@ Start, Stop, Restart

Run `cylc validate` to check for any errors::

cylc validate .
cylc validate .

#. **Add Runtime Configuration For The** ``get_observations`` **Tasks.**

Expand Down Expand Up @@ -492,4 +494,3 @@ Start, Stop, Restart
i.e. the final cycle point.
* ``task-name`` - set this to "forecast".
* ``submission-number`` - set this to "01".

26 changes: 13 additions & 13 deletions src/user-guide/remote-job-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ SSH-free Job Management?

Some sites may want to restrict access to job hosts by whitelisting SSH
connections to allow only ``rsync`` for file transfer, and allowing job
execution only via a local batch system that sees the job hosts [1]_ .
execution only via a local :term:`job runner` that sees the job hosts [1]_ .
We are investigating the feasibility of SSH-free job management when a local
batch system is available, but this is not yet possible unless your suite
job runner is available, but this is not yet possible unless your suite
and job hosts also share a filesystem, which allows Cylc to treat jobs as
entirely local [2]_ .

Expand All @@ -25,14 +25,14 @@ Cylc does not have persistent agent processes running on job hosts to act on
instructions received over the network [3]_ so instead we execute job
management commands directly on job hosts over SSH. Reasons for this include:

- It works equally for batch system and background jobs.
- SSH is *required* for background jobs, and for batch jobs if the
batch system is not available on the suite host.
- Querying the batch system alone is not sufficient for full job
- It works equally for :term:`job runner` and background jobs.
- SSH is *required* for background jobs, and for jobs in other job runners if the
job runner is not available on the suite host.
- Querying the job runner alone is not sufficient for full job
polling functionality.

- This is because jobs can complete (and then be forgotten by
the batch system) while the network, suite host, or :term:`scheduler` is
the job runner) while the network, suite host, or :term:`scheduler` is
down (e.g. between suite shutdown and restart).
- To handle this we get the automatic job wrapper code to write
job messages and exit status to *job status files* that are
Expand All @@ -41,7 +41,7 @@ management commands directly on job hosts over SSH. Reasons for this include:
- Job status files reside on the job host, so the interrogation
is done over SSH.

- Job status files also hold batch system name and job ID; this is
- Job status files also hold job runner name and job ID; this is
written by the job submit command, and read by job poll and kill commands


Expand All @@ -56,10 +56,10 @@ Other Cases Where Cylc Uses SSH Directly


.. [1] A malicious script could be ``rsync``'d and run from a batch
job, but batch jobs are considered easier to audit.
job, but jobs in job runners are considered easier to audit.
.. [2] The job ID must also be valid to query and kill the job via the local
batch system. This is not the case for Slurm, unless the ``--cluster``
option is explicitly used in job query and kill commands, otherwise
the job ID is not recognized by the local Slurm instance.
:term:`job runner`. This is not the case for Slurm, unless the
``--cluster`` option is explicitly used in job query and kill commands,
otherwise the job ID is not recognized by the local Slurm instance.
.. [3] This would be a more complex solution, in terms of implementation,
administration, and security.
22 changes: 12 additions & 10 deletions src/user-guide/running-suites.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
Running Suites
==============

.. TODO - platformise

This chapter currently features a diverse collection of topics related
to running suites.

Expand Down Expand Up @@ -203,7 +205,7 @@ not automatically resubmitted at restart in case the underlying problem has not
been addressed yet.

Tasks recorded in the submitted or running states are automatically polled on
restart, to see if they are still waiting in a batch queue, still running, or
restart, to see if they are still waiting in a :term:`job runner` queue, still running, or
if they succeeded or failed while the suite was down. The suite state will be
updated automatically according to the poll results.

Expand Down Expand Up @@ -256,9 +258,9 @@ Authentication Files
Cylc uses `CurveZMQ <http://curvezmq.org/page:read-the-docs/>`_ to ensure that
any data, sent between the :term:`scheduler <scheduler>` and the client,
remains protected during transmission. Public keys are used to encrypt the
data, private keys for decryption.
data, private keys for decryption.

Authentication files will be created in your
Authentication files will be created in your
``$HOME/cylc-run/WORKFLOW/.service/`` directory at start-up. You can expect to
find one client public key per file system for remote jobs.

Expand Down Expand Up @@ -304,7 +306,7 @@ outage prevents task success or failure messages getting through, or if the
:term:`scheduler` itself is down when tasks finish execution.

To poll a task job the :term:`scheduler` interrogates the
batch system, and the ``job.status`` file, on the job host. This
:term:`job runner`, and the ``job.status`` file, on the job host. This
information is enough to determine the final task status even if the
job finished while the :term:`scheduler` was down or unreachable on
the network.
Expand Down Expand Up @@ -457,7 +459,7 @@ As a suite runs, its task proxies may pass through the following states:
- **ready** - ready to run (prerequisites satisfied) and
handed to cylc's job submission sub-system.
- **submitted** - submitted to run, but not executing yet
(could be waiting in an external batch scheduler queue).
(could be waiting in an external :term:`job runner` queue).
- **submit-failed** - job submission failed *or*
submitted job killed (cancelled) before commencing execution.
- **submit-retrying** - job submission failed, but a submission retry
Expand Down Expand Up @@ -838,11 +840,11 @@ started running, and they still appear in the resource manager queue).
Loadleveler jobs that are preempted by kill-and-requeue ("job vacation") are
automatically returned to the submitted state by Cylc. This is possible
because Loadleveler sends the SIGUSR1 signal before SIGKILL for preemption.
Other batch schedulers just send SIGTERM before SIGKILL as normal, so Cylc
Other :term:`job runners <job runner>` just send SIGTERM before SIGKILL as normal, so Cylc
cannot distinguish a preemption job kill from a normal job kill. After this the
job will poll as failed (correctly, because it was killed, and the job status
file records that). To handle this kind of preemption automatically you could
use a task failed or retry event handler that queries the batch scheduler queue
use a task failed or retry event handler that queries the job runner queue
(after an appropriate delay if necessary) and then, if the job has been
requeued, uses ``cylc reset`` to reset the task to the submitted state.

Expand Down Expand Up @@ -1052,10 +1054,10 @@ run lengths.
Limitations Of Suite Simulation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Dummy mode ignores batch scheduler settings because Cylc does not know which
Dummy mode ignores :term:`job runner` settings because Cylc does not know which
job resource directives (requested memory, number of compute nodes, etc.) would
need to be changed for the dummy jobs. If you need to dummy-run jobs on a
batch scheduler manually comment out ``script`` items and modify
job runner manually comment out ``script`` items and modify
directives in your live suite, or else use a custom live mode test suite.

.. note::
Expand Down Expand Up @@ -1108,7 +1110,7 @@ a cylc upgrade will not break your own complex
suites - the triggering check will catch any bug that causes a task to
run when it shouldn't, for instance; even in a dummy mode reference
test the full task job script (sans ``script`` items) executes on the
proper task host by the proper batch system.
proper task host by the proper :term:`job runner`.

Reference tests can be configured with the following settings:

Expand Down
Loading