Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): Workflow-level retryStrategy/resubmit pending pods by default. Closes #3918 #3965

Merged
merged 17 commits into from
Sep 21, 2020
Merged
12 changes: 8 additions & 4 deletions api/openapi-spec/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3769,10 +3769,6 @@
"description": "Resource template subtype which can run k8s resources",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.ResourceTemplate"
},
"resubmitPendingPods": {
"description": "ResubmitPendingPods is a flag to enable resubmitting pods that remain Pending after initial submission",
"type": "boolean"
},
"retryStrategy": {
"description": "RetryStrategy describes how to retry a template when it fails",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.RetryStrategy"
Expand Down Expand Up @@ -4412,6 +4408,10 @@
"type": "integer",
"format": "int32"
},
"retryStrategy": {
"description": "RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.RetryStrategy"
},
"schedulerName": {
"description": "Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.",
"type": "string"
Expand Down Expand Up @@ -4867,6 +4867,10 @@
"type": "integer",
"format": "int32"
},
"retryStrategy": {
"description": "RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.",
"$ref": "#/definitions/io.argoproj.workflow.v1alpha1.RetryStrategy"
},
"schedulerName": {
"description": "Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.",
"type": "string"
Expand Down
123 changes: 68 additions & 55 deletions docs/fields.md
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,7 @@ WorkflowSpec is the specification of a Workflow.
|`podPriorityClassName`|`string`|PriorityClassName to apply to workflow pods.|
|`podSpecPatch`|`string`|PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits).|
|`priority`|`int32`|Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first.|
|`retryStrategy`|[`RetryStrategy`](#retrystrategy)|RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.|
|`schedulerName`|`string`|Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.|
|`securityContext`|[`PodSecurityContext`](#podsecuritycontext)|SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.|
|`serviceAccountName`|`string`|ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as.|
Expand Down Expand Up @@ -1287,6 +1288,7 @@ WorkflowTemplateSpec is a spec of WorkflowTemplate.
|`podPriorityClassName`|`string`|PriorityClassName to apply to workflow pods.|
|`podSpecPatch`|`string`|PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits).|
|`priority`|`int32`|Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first.|
|`retryStrategy`|[`RetryStrategy`](#retrystrategy)|RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1.|
|`schedulerName`|`string`|Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified.|
|`securityContext`|[`PodSecurityContext`](#podsecuritycontext)|SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.|
|`serviceAccountName`|`string`|ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as.|
Expand Down Expand Up @@ -1522,6 +1524,40 @@ PodGC describes how to delete completed pods as they complete
|:----------:|:----------:|---------------|
|`strategy`|`string`|Strategy is the strategy to use. One of "OnPodCompletion", "OnPodSuccess", "OnWorkflowCompletion", "OnWorkflowSuccess"|

## RetryStrategy

RetryStrategy provides controls on how to retry a workflow step

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`clustertemplates.yaml`](https://github.com/argoproj/argo/blob/master/examples/cluster-workflow-template/clustertemplates.yaml)

- [`dag-disable-failFast.yaml`](https://github.com/argoproj/argo/blob/master/examples/dag-disable-failFast.yaml)

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)

- [`retry-container-to-completion.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container-to-completion.yaml)

- [`retry-container.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container.yaml)

- [`retry-on-error.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-on-error.yaml)

- [`retry-script.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-script.yaml)

- [`retry-with-steps.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-with-steps.yaml)

- [`templates.yaml`](https://github.com/argoproj/argo/blob/master/examples/workflow-template/templates.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`backoff`|[`Backoff`](#backoff)|Backoff is a backoff strategy|
|`limit`|[`IntOrString`](#intorstring)|Limit is the maximum number of attempts when retrying a container|
|`retryPolicy`|`string`|RetryPolicy is a policy of NodePhase statuses that will be retried|

## Synchronization

Synchronization holds synchronization lock configuration
Expand Down Expand Up @@ -1844,7 +1880,6 @@ Template is a reusable and composable unit of execution in a workflow
|`priority`|`int32`|Priority to apply to workflow pods.|
|`priorityClassName`|`string`|PriorityClassName to apply to workflow pods.|
|`resource`|[`ResourceTemplate`](#resourcetemplate)|Resource template subtype which can run k8s resources|
|`resubmitPendingPods`|`boolean`|ResubmitPendingPods is a flag to enable resubmitting pods that remain Pending after initial submission|
|`retryStrategy`|[`RetryStrategy`](#retrystrategy)|RetryStrategy describes how to retry a template when it fails|
|`schedulerName`|`string`|If specified, the pod will be dispatched by specified scheduler. Or it will be dispatched by workflow scope scheduler if specified. If neither specified, the pod will be dispatched by default scheduler.|
|`script`|[`ScriptTemplate`](#scripttemplate)|Script runs a portion of code against an interpreter|
Expand Down Expand Up @@ -2305,6 +2340,24 @@ Prometheus is a prometheus metric to be emitted
|`name`|`string`|Name is the name of the metric|
|`when`|`string`|When is a conditional statement that decides when to emit the metric|

## Backoff

Backoff is a backoff strategy to use within retryStrategy

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`duration`|`string`|Duration is the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")|
|`factor`|[`IntOrString`](#intorstring)|Factor is a factor to multiply the base duration after each failed retry|
|`maxDuration`|`string`|MaxDuration is the maximum amount of time allowed for the backoff strategy|

## Mutex

Mutex holds Mutex configuration
Expand Down Expand Up @@ -2956,40 +3009,6 @@ ResourceTemplate is a template subtype to manipulate kubernetes resources
|`setOwnerReference`|`boolean`|SetOwnerReference sets the reference to the workflow on the OwnerReference of generated resource.|
|`successCondition`|`string`|SuccessCondition is a label selector expression which describes the conditions of the k8s resource in which it is acceptable to proceed to the following step|

## RetryStrategy

RetryStrategy provides controls on how to retry a workflow step

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`clustertemplates.yaml`](https://github.com/argoproj/argo/blob/master/examples/cluster-workflow-template/clustertemplates.yaml)

- [`dag-disable-failFast.yaml`](https://github.com/argoproj/argo/blob/master/examples/dag-disable-failFast.yaml)

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)

- [`retry-container-to-completion.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container-to-completion.yaml)

- [`retry-container.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container.yaml)

- [`retry-on-error.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-on-error.yaml)

- [`retry-script.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-script.yaml)

- [`retry-with-steps.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-with-steps.yaml)

- [`templates.yaml`](https://github.com/argoproj/argo/blob/master/examples/workflow-template/templates.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`backoff`|[`Backoff`](#backoff)|Backoff is a backoff strategy|
|`limit`|[`IntOrString`](#intorstring)|Limit is the maximum number of attempts when retrying a container|
|`retryPolicy`|`string`|RetryPolicy is a policy of NodePhase statuses that will be retried|

## ScriptTemplate

ScriptTemplate is a template subtype to enable scripting through code steps
Expand Down Expand Up @@ -3739,24 +3758,6 @@ _No description available_
|:----------:|:----------:|---------------|
|`configMap`|[`ConfigMapKeySelector`](#configmapkeyselector)|_No description available_|

## Backoff

Backoff is a backoff strategy to use within retryStrategy

<details>
<summary>Examples with this field (click to open)</summary>
<br>

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)
</details>

### Fields
| Field Name | Field Type | Description |
|:----------:|:----------:|---------------|
|`duration`|`string`|Duration is the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")|
|`factor`|[`IntOrString`](#intorstring)|Factor is a factor to multiply the base duration after each failed retry|
|`maxDuration`|`string`|MaxDuration is the maximum amount of time allowed for the backoff strategy|

## ContinueOn

ContinueOn defines if a workflow should continue even if a task or step fails/errors. It can be specified if the workflow should continue when the pod errors, fails or both.
Expand Down Expand Up @@ -4447,9 +4448,21 @@ IntOrString is a type that can hold an int32 or a string. When used in JSON or
<summary>Examples with this field (click to open)</summary>
<br>

- [`timeouts-step.yaml`](https://github.com/argoproj/argo/blob/master/examples/timeouts-step.yaml)
- [`clustertemplates.yaml`](https://github.com/argoproj/argo/blob/master/examples/cluster-workflow-template/clustertemplates.yaml)

- [`timeouts-workflow.yaml`](https://github.com/argoproj/argo/blob/master/examples/timeouts-workflow.yaml)
- [`dag-disable-failFast.yaml`](https://github.com/argoproj/argo/blob/master/examples/dag-disable-failFast.yaml)

- [`retry-backoff.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-backoff.yaml)

- [`retry-container.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-container.yaml)

- [`retry-on-error.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-on-error.yaml)

- [`retry-script.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-script.yaml)

- [`retry-with-steps.yaml`](https://github.com/argoproj/argo/blob/master/examples/retry-with-steps.yaml)

- [`templates.yaml`](https://github.com/argoproj/argo/blob/master/examples/workflow-template/templates.yaml)
</details>

## Container
Expand Down
3 changes: 2 additions & 1 deletion docs/swagger.md
Original file line number Diff line number Diff line change
Expand Up @@ -1541,7 +1541,6 @@ Template is a reusable and composable unit of execution in a workflow
| priority | integer | Priority to apply to workflow pods. | No |
| priorityClassName | string | PriorityClassName to apply to workflow pods. | No |
| resource | [io.argoproj.workflow.v1alpha1.ResourceTemplate](#io.argoproj.workflow.v1alpha1.resourcetemplate) | Resource template subtype which can run k8s resources | No |
| resubmitPendingPods | boolean | ResubmitPendingPods is a flag to enable resubmitting pods that remain Pending after initial submission | No |
| retryStrategy | [io.argoproj.workflow.v1alpha1.RetryStrategy](#io.argoproj.workflow.v1alpha1.retrystrategy) | RetryStrategy describes how to retry a template when it fails | No |
| schedulerName | string | If specified, the pod will be dispatched by specified scheduler. Or it will be dispatched by workflow scope scheduler if specified. If neither specified, the pod will be dispatched by default scheduler. | No |
| script | [io.argoproj.workflow.v1alpha1.ScriptTemplate](#io.argoproj.workflow.v1alpha1.scripttemplate) | Script runs a portion of code against an interpreter | No |
Expand Down Expand Up @@ -1769,6 +1768,7 @@ WorkflowSpec is the specification of a Workflow.
| podPriorityClassName | string | PriorityClassName to apply to workflow pods. | No |
| podSpecPatch | string | PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits). | No |
| priority | integer | Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first. | No |
| retryStrategy | [io.argoproj.workflow.v1alpha1.RetryStrategy](#io.argoproj.workflow.v1alpha1.retrystrategy) | RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1. | No |
| schedulerName | string | Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified. | No |
| securityContext | [io.k8s.api.core.v1.PodSecurityContext](#io.k8s.api.core.v1.podsecuritycontext) | SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. | No |
| serviceAccountName | string | ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as. | No |
Expand Down Expand Up @@ -1928,6 +1928,7 @@ WorkflowTemplateSpec is a spec of WorkflowTemplate.
| podPriorityClassName | string | PriorityClassName to apply to workflow pods. | No |
| podSpecPatch | string | PodSpecPatch holds strategic merge patch to apply against the pod spec. Allows parameterization of container fields which are not strings (e.g. resource limits). | No |
| priority | integer | Priority is used if controller is configured to process limited number of workflows in parallel. Workflows with higher priority are processed first. | No |
| retryStrategy | [io.argoproj.workflow.v1alpha1.RetryStrategy](#io.argoproj.workflow.v1alpha1.retrystrategy) | RetryStrategy for all templates in the io.argoproj.workflow.v1alpha1. | No |
| schedulerName | string | Set scheduler name for all pods. Will be overridden if container/script template's scheduler name is set. Default scheduler will be used if neither specified. | No |
| securityContext | [io.k8s.api.core.v1.PodSecurityContext](#io.k8s.api.core.v1.podsecuritycontext) | SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field. | No |
| serviceAccountName | string | ServiceAccountName is the name of the ServiceAccount to run all pods of the workflow as. | No |
Expand Down
38 changes: 38 additions & 0 deletions docs/tolerating-pod-deletion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Tolerating Pod Deletion

> v2.12 and after

In Kubernetes, pods are cattle and can be deleted at any time. Deletion could be manually via `kubectl delete pod`, during a node drain, or for other reasons.

This can be very inconvenient, your workflow will error, but for reasons outside of your control.

A [pod disruption budget](examples/default-pdb-support.yaml) can reduce the likelihood of this happening. But, it cannot entirely prevent it.

To retry pods that were deleted, set `retryStrategy.retryPolicy: OnError`.

This can be set at a workflow-level, template-level, or globally (using [workflow defaults](default-workflow-specs.md))

## Example

Run the following workflow (which will sleep for 30s):

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: example
spec:
retryStrategy:
retryPolicy: OnError
limit: 1
entrypoint: main
templates:
- name: main
container:
image: docker/whalesay:latest
command:
- sleep
- 30s
```

Then execute `kubectl delete pod example`. You'll see that the errored node is automatically retried.
24 changes: 22 additions & 2 deletions manifests/base/crds/full/argoproj.io_clusterworkflowtemplates.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,28 @@ spec:
priority:
format: int32
type: integer
retryStrategy:
properties:
backoff:
properties:
duration:
type: string
factor:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
maxDuration:
type: string
type: object
limit:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
retryPolicy:
type: string
type: object
schedulerName:
type: string
securityContext:
Expand Down Expand Up @@ -3999,8 +4021,6 @@ spec:
required:
- action
type: object
resubmitPendingPods:
type: boolean
retryStrategy:
properties:
backoff:
Expand Down
24 changes: 22 additions & 2 deletions manifests/base/crds/full/argoproj.io_cronworkflows.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -813,6 +813,28 @@ spec:
priority:
format: int32
type: integer
retryStrategy:
properties:
backoff:
properties:
duration:
type: string
factor:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
maxDuration:
type: string
type: object
limit:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
retryPolicy:
type: string
type: object
schedulerName:
type: string
securityContext:
Expand Down Expand Up @@ -4020,8 +4042,6 @@ spec:
required:
- action
type: object
resubmitPendingPods:
type: boolean
retryStrategy:
properties:
backoff:
Expand Down
Loading