Run steps sequentially / retry when parallel steps exceed memory quota #721

moreginger · 2018-02-06T15:07:04Z

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened:

We set a requests.memory quota for the workflows namespace. Then we ran a workflow with parallel steps which together (but not singly) exceeded the quota. The workflow failed these steps with the following error in the workflow-controller:

time="2018-02-06T14:51:34Z" level=info msg="Updated phase Running -> Error" namespace=argo-workflows workflow=steps-6kgn6
time="2018-02-06T14:51:34Z" level=info msg="Updated message  -> pods \"steps-6kgn6-2822283003\" is forbidden: exceeded quota: argo-workflows-quota, requested: requests.memory=8256Mi, used: requests.memory=8256Mi, limited: requests.memory=12Gi" namespace=argo-workflows workflow=steps-6kgn6

What you expected to happen:

The steps should have been run sequentially and succeeded. If separate processes (i.e. other argo workflows) are causing the quota to be exceeded, argo should be able to retry.

How to reproduce it (as minimally and precisely as possible):

Set a requests.memory quota for the workflow namespace...

apiVersion: v1
kind: ResourceQuota
metadata:
  name: argo-workflows-quota
spec:
  hard:
    requests.memory: 12Gi

Submit the following workflow...

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello1"
    - - name: hello2a
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello2a"
      - name: hello2b
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello2b"

  - name: whalesay
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
      resources:
        requests:
          memory: 8Gi

Anything else we need to know?:

Environment:

Argo version:

$ argo version

argo: v2.0.0-beta1
BuildDate: 2018-01-18T22:06:03Z
GitCommit: 549870c
GitTreeState: clean
GitTag: v2.0.0-beta1
GoVersion: go1.9.1
Compiler: gc
Platform: linux/amd64

Kubernetes version :

$ kubectl version -o yaml

clientVersion:
buildDate: 2018-01-18T10:09:24Z
compiler: gc
gitCommit: 5fa2db2bd46ac79e5e00a4e6ed24191080aa463b
gitTreeState: clean
gitVersion: v1.9.2
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64
serverVersion:
buildDate: 2017-12-15T20:55:30Z
compiler: gc
gitCommit: 925c127ec6b946659ad0fd596fa959be43f0cc05
gitTreeState: clean
gitVersion: v1.9.0
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64

The text was updated successfully, but these errors were encountered:

jessesuen · 2018-02-06T19:16:59Z

There is a retryStrategy field which will allow container steps to be retried in the event of a failure (unfortunately this does not have a backoff policy yet). Also, in 2.1, there will be a parallelism feature to limit the parallelism of a workflow and/or template. Will these satisfy your use case?

I don't think scheduling algorithms based on resources belongs in the controller because this is supposed to be the responsibility of an admission controller.

moreginger · 2018-02-06T20:49:18Z

Thanks, hmmmm. I didn't know about retryStrategy or k8s Admission Controllers. From a quick read, I don't think that an Admission Controller would help much here. What we're trying to do is limit the resources that a particular namespace can use at one time, but not blocking a workflow unless a single step claims too many resources. This does sound more like the job a (custom?) scheduler should be doing, perhaps in combination with an Admission Controller which stops you submitting a pod which could never run given the scheduler quotas.

mlarrousse · 2018-10-10T20:09:47Z

My team is in a similar boat right now too.

Jobs, StatefulSets, Deployments, etc all check the CPU and Memlimit quotas and retry until pods can be scheduled.

Parallelism doesn't satisfy our use case as we want to limit the total sum of work that a user has running vs limiting single workflows to a specified amount.

jessesuen · 2019-04-19T23:34:53Z

I believe PR #1096 is trying to solve this. Will try to revive this PR based on current master.

DylanBowden · 2019-12-23T09:49:10Z

Hi, PR #1096 is closed but not because it was merged. I'm guessing the idea of limiting the parallel executions didn't take off ? Thanks

alexec · 2019-12-23T20:42:01Z

I closed that PR because it was inactive for 1 year. I believe that you can use the parallelism feature to achieve the same goal?

DylanBowden · 2019-12-26T15:50:46Z

Indeed you can, it just took me a while to figure-out how.

For anyone else searching :

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my_workflow
  generateName: my_workflow
spec:
  parallelism: 2
 ...

simster7 · 2020-06-04T21:30:22Z

Closed by #2385

jessesuen added the type/feature Feature request label Apr 19, 2019

terrytangyuan mentioned this issue Jun 4, 2020

fix(controller): Handle quota issues during pod reconciliation #3175

Closed

6 tasks

simster7 closed this as completed Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run steps sequentially / retry when parallel steps exceed memory quota #721

Run steps sequentially / retry when parallel steps exceed memory quota #721

moreginger commented Feb 6, 2018

jessesuen commented Feb 6, 2018

moreginger commented Feb 6, 2018

mlarrousse commented Oct 10, 2018

jessesuen commented Apr 19, 2019

DylanBowden commented Dec 23, 2019

alexec commented Dec 23, 2019

DylanBowden commented Dec 26, 2019 •

edited

Loading

simster7 commented Jun 4, 2020

Run steps sequentially / retry when parallel steps exceed memory quota #721

Run steps sequentially / retry when parallel steps exceed memory quota #721

Comments

moreginger commented Feb 6, 2018

jessesuen commented Feb 6, 2018

moreginger commented Feb 6, 2018

mlarrousse commented Oct 10, 2018

jessesuen commented Apr 19, 2019

DylanBowden commented Dec 23, 2019

alexec commented Dec 23, 2019

DylanBowden commented Dec 26, 2019 • edited Loading

simster7 commented Jun 4, 2020

DylanBowden commented Dec 26, 2019 •

edited

Loading