diff --git a/teps/0069-support-retries-for-custom-task-in-a-pipeline.md b/teps/0069-support-retries-for-custom-task-in-a-pipeline.md index 45273a2a5..d66d1aaaf 100644 --- a/teps/0069-support-retries-for-custom-task-in-a-pipeline.md +++ b/teps/0069-support-retries-for-custom-task-in-a-pipeline.md @@ -29,6 +29,7 @@ authors: - [Alternatives](#alternatives) - [Infrastructure Needed (optional)](#infrastructure-needed-optional) - [Upgrade & Migration Strategy (optional)](#upgrade--migration-strategy-optional) +- [Implementation Pull request(s)](#implementation-pull-request-s) - [References (optional)](#references-optional) @@ -67,8 +68,8 @@ stub code to make it easy how to support it. ### Goals * Support propagating `pipelineSpec.task.retries` count information to custom-task controllers. -* Support signalling a `retry` to custom task controller, for a specific run. A - custom controller may optionally support it. +* Support updating the status of retry history to `tektoncd pipeline` + controller. * Gracefully handle the case where, custom controller does not support retry and yet the `PipelineRun` happens to be configured with retry. This also implies, an existing controller should not mis-behave if it is _not_ upgraded @@ -108,74 +109,17 @@ None. Requesting API changes: -1. Add field `Retries` to `RunSpec`, an integer count which acts as a FYI to +1. Add field `Retries` to `RunSpec`, an integer count which is communicated to custom task controller. -2. Add a new `RunRetry`, in addition to `RunCancelled` status to `RunSpecStatus` - i.e. `v1alpha1.RunSpecStatusRetry` -3. Add a field `RetriesStatus` to `RunStatusFields`, to maintain the retry +2. Add a field `RetriesStatus` to `RunStatusFields`, to maintain the retry history for a `Run`, similar to `v1beta1.TaskRunStatusFields.RetriesStatus` + This field is updated by the custom task controller. -Proposed algorithm for performing a retry for custom task. - -- Step 1. A `pipelineTask` consisting of a custom task X, is configured with - `retries` count. - -- Step 2. On failure of task X, `pipelinerun` controller sees a request for a - retry. It then communicates the same to custom task `Run` by patching - `/spec/status` with a `v1alpha1.RunSpecStatusRetry` i.e. `RunRetry`. Similar - to request a custom task to cancel. - -- Step 3. In addition to patching the `pipelinerun` controller also enqueue a timer - `time.After(30*time.Second)` (configurable). On completion of timeout - (i.e. 30s), it checks if `/spec/status` is `RunRetry`, then it assumes that - custom task does not support retry. - - a) if custom task does not supports retry as above, It sets no. of `retry done` - to the `retries` count configured - i.e. exhaust all retries. - - b) if custom task does support retry, update retry history. - -- Step 4. The custom task that wants to support the retry, has to update - - a) `status.conditions` to indicate it is `Running`. - - b) clear `/spec/status` if it is `RunRetry`. - -_A task may retry and immediately fail, so controller cannot fully rely on -`status.conditions`._ +A pipeline task may be configured with a timeout, and the timeout includes time +required to perform all the retries. ### Notes/Caveats (optional) - -Q. A Custom task does not support retry, and is configured to run with retry. - How to gracefully handle this case? - - _Approach proposed:_ The `pipelineRun` waits for a configurable shorter timeout - (say 30s), and if the custom task controller does not signal that it has begun - to retry, assume it does not support `retry`. - -Other options: - -* Option 1: The `pipelineRun` should wait till the timeout and fail. The - downside of this approach is, it may wait for a very long period of time. -* Option 2: It should have a way of knowing custom task does not support a - retry. e.g. Custom controllers declaring that they support retry, somehow - (not sure how this can be done). - -Q. In a rare scenario, what if there is race between "RunRetry" and - "RunCancelled", i.e. tektoncd controller asks the custom controller to retry - and soon after decides to cancel (or user invoked cancel). Meanwhile, custom - controller detects, that it has been asked to retry and begins by clearing - its status. This may cause, custom controller to miss the cancel update. - -```go -patches := []jsonpatch.JsonPatchOperation{{ - Operation: "test", - Path: "/spec/status", - Value: v1alpha1.RunSpecStatusRetry, - }, { - Operation: "remove", - Path: "/spec/status", - }} -``` - -A patch as above can be used i.e. test if `/spec/status` -has `v1alpha1.RunSpecStatusRetry` then clear it, else fail. +None. ### Risks and Mitigations @@ -220,32 +164,18 @@ Add an optional `Retries` field of type `int` to `RunSpec`. Add optional `RetriesStatus` field to `RunStatusFields` of type `[]RunStatus`. -Add a config map entry (default-short-timeout-seconds) to `config-defaults` in -order to make short timeout configurable. - -```yaml -# default-short-timeout-seconds contains the default number of -# seconds to wait for custom task to respond, on timeout it is assumed -# custom task does not support the feature. Currently, it is used to -# quickly timeout a retry in a custom-task. -default-short-timeout-seconds: "30" # 30 seconds -``` - -Introduce a new status `RunSpecStatusRetry RunSpecStatus = "RunRetry"` for -`/spec/status` of a `Run`. - -Algorithm for performing a retry for a custom task is same as proposal section -[Proposal](#proposal). +A custom task controller can optionally support retry, and can honor the retries +count, and update the `RetriesStatus` on each retry. ## Test Plan Add unit tests and e2e integration tests for following two cases. -1. If the custom task, does not support a retry, we wait until the configured shorter timeout - (30 seconds by default) and exhaust all retries. +1. If the custom task, does not support a retry, we wait until the configured + pipeline task timeout and fail. -2. If the custom task *does support* a retry i.e., it does clear its 'spec.status', - on each retry. Verify it performs the correct number of retries. +2. If the custom task *does support* a retry i.e., it does update the `RetriesStatus`. + Verify it performs the correct number of retries. ## Design Evaluation