Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run steps sequentially / retry when parallel steps exceed memory quota #721

Closed
moreginger opened this issue Feb 6, 2018 · 8 comments
Closed
Labels
type/feature Feature request

Comments

@moreginger
Copy link

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened:

We set a requests.memory quota for the workflows namespace. Then we ran a workflow with parallel steps which together (but not singly) exceeded the quota. The workflow failed these steps with the following error in the workflow-controller:

time="2018-02-06T14:51:34Z" level=info msg="Updated phase Running -> Error" namespace=argo-workflows workflow=steps-6kgn6
time="2018-02-06T14:51:34Z" level=info msg="Updated message  -> pods \"steps-6kgn6-2822283003\" is forbidden: exceeded quota: argo-workflows-quota, requested: requests.memory=8256Mi, used: requests.memory=8256Mi, limited: requests.memory=12Gi" namespace=argo-workflows workflow=steps-6kgn6

What you expected to happen:

The steps should have been run sequentially and succeeded. If separate processes (i.e. other argo workflows) are causing the quota to be exceeded, argo should be able to retry.

How to reproduce it (as minimally and precisely as possible):

Set a requests.memory quota for the workflow namespace...

apiVersion: v1
kind: ResourceQuota
metadata:
  name: argo-workflows-quota
spec:
  hard:
    requests.memory: 12Gi

Submit the following workflow...

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  templates:
  - name: hello-hello-hello
    steps:
    - - name: hello1
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello1"
    - - name: hello2a
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello2a"
      - name: hello2b
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello2b"

  - name: whalesay
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
      resources:
        requests:
          memory: 8Gi

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version

argo: v2.0.0-beta1
BuildDate: 2018-01-18T22:06:03Z
GitCommit: 549870c
GitTreeState: clean
GitTag: v2.0.0-beta1
GoVersion: go1.9.1
Compiler: gc
Platform: linux/amd64

  • Kubernetes version :
$ kubectl version -o yaml

clientVersion:
buildDate: 2018-01-18T10:09:24Z
compiler: gc
gitCommit: 5fa2db2bd46ac79e5e00a4e6ed24191080aa463b
gitTreeState: clean
gitVersion: v1.9.2
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64
serverVersion:
buildDate: 2017-12-15T20:55:30Z
compiler: gc
gitCommit: 925c127ec6b946659ad0fd596fa959be43f0cc05
gitTreeState: clean
gitVersion: v1.9.0
goVersion: go1.9.2
major: "1"
minor: "9"
platform: linux/amd64

@jessesuen
Copy link
Member

There is a retryStrategy field which will allow container steps to be retried in the event of a failure (unfortunately this does not have a backoff policy yet). Also, in 2.1, there will be a parallelism feature to limit the parallelism of a workflow and/or template. Will these satisfy your use case?

I don't think scheduling algorithms based on resources belongs in the controller because this is supposed to be the responsibility of an admission controller.

@moreginger
Copy link
Author

Thanks, hmmmm. I didn't know about retryStrategy or k8s Admission Controllers. From a quick read, I don't think that an Admission Controller would help much here. What we're trying to do is limit the resources that a particular namespace can use at one time, but not blocking a workflow unless a single step claims too many resources. This does sound more like the job a (custom?) scheduler should be doing, perhaps in combination with an Admission Controller which stops you submitting a pod which could never run given the scheduler quotas.

@mlarrousse
Copy link

My team is in a similar boat right now too.

Jobs, StatefulSets, Deployments, etc all check the CPU and Memlimit quotas and retry until pods can be scheduled.

Parallelism doesn't satisfy our use case as we want to limit the total sum of work that a user has running vs limiting single workflows to a specified amount.

@jessesuen jessesuen added the type/feature Feature request label Apr 19, 2019
@jessesuen
Copy link
Member

I believe PR #1096 is trying to solve this. Will try to revive this PR based on current master.

@DylanBowden
Copy link

Hi, PR #1096 is closed but not because it was merged. I'm guessing the idea of limiting the parallel executions didn't take off ? Thanks

@alexec
Copy link
Contributor

alexec commented Dec 23, 2019

I closed that PR because it was inactive for 1 year. I believe that you can use the parallelism feature to achieve the same goal?

@DylanBowden
Copy link

DylanBowden commented Dec 26, 2019

Indeed you can, it just took me a while to figure-out how.

For anyone else searching :

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my_workflow
  generateName: my_workflow
spec:
  parallelism: 2
 ...

@simster7
Copy link
Member

simster7 commented Jun 4, 2020

Closed by #2385

@simster7 simster7 closed this as completed Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

6 participants