Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation cannot be fulfilled on resourcequotas "gke-resource-quotas": the object has been modified; please apply your changes to the latest version and try again #3217

Closed
4 tasks done
bryanlarsen opened this issue Jun 11, 2020 · 14 comments · Fixed by #3853
Assignees
Labels

Comments

@bryanlarsen
Copy link

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:

Argo workflows failed with the error "Operation cannot be fulfilled on resourcequotas "gke-resource-quotas": the object has been modified; please apply your changes to the latest version and try again"

What you expected to happen:

Workflows complete.

How to reproduce it (as minimally and precisely as possible):

Try and run two workflows simultaneously, each with a parallelism of > 300 or so.

Sample workflows attached.

Anything else we need to know?:

This seems to be a Kubernetes bug, but a workaround in Argo would be much more under our control as we're running in GKE.

kubernetes/kubernetes#67761

Environment:

  • Argo version: 2.8.1
  • Kubernetes version: v1.14.10-gke.36

Other debugging information (if applicable):

  • workflow-controller logs:

time="2020-06-11T18:14:20Z" level=info msg="node &NodeStatus{ID:loops-param-result-b-1634545795,Name:loops-param-result-b[1].sleep(322:2),DisplayName:sleep(322:2),Type:Pod,TemplateName:sleep-n-sec,TemplateRef:nil,Phase:Error,BoundaryID:loops-param-result-b,Message:,StartedAt:2020-06-11 18:14:20.364568768 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,PodIP:,Daemoned:nil,Inputs:&Inputs{Parameters:[]Parameter{Parameter{Name:seconds,Default:nil,Value:*2,ValueFrom:nil,GlobalName:,},},Artifacts:[]Artifact{},},Outputs:nil,Children:[],OutboundNodes:[],StoredTemplateID:,WorkflowTemplateName:,TemplateScope:local/loops-param-result-b,ResourcesDuration:ResourcesDuration{},HostNodeName:,} message: Operation cannot be fulfilled on resourcequotas \"gke-resource-quotas\": the object has been modified; please apply your changes to the latest version and try again" namespace=argo workflow=loops-param-result-b

  • sample workflows that reproduce issue:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: loops-param-result-a
spec:
  entrypoint: loop-param-result-example
  serviceAccountName: argo
  templates:
  - name: loop-param-result-example
    parallelism: 300
    steps:
    - - name: generate
        template: gen-number-list
    # Iterate over the list of numbers generated by the generate step above
    - - name: sleep
        template: sleep-n-sec
        arguments:
          parameters:
          - name: seconds
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    script:
      image: alpine:latest
      command: [sh, -c]
      args:
        - echo "[$(for i in $(seq 100) ; do seq 0 9 | tr '\n' ','; done )10]"

  - name: sleep-n-sec
    inputs:
      parameters:
      - name: seconds
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo sleeping for {{inputs.parameters.seconds}} seconds; sleep {{inputs.parameters.seconds}}; echo done"]
---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: loops-param-result-b
spec:
  entrypoint: loop-param-result-example
  serviceAccountName: argo
  templates:
  - name: loop-param-result-example
    parallelism: 300
    steps:
    - - name: generate
        template: gen-number-list
    # Iterate over the list of numbers generated by the generate step above
    - - name: sleep
        template: sleep-n-sec
        arguments:
          parameters:
          - name: seconds
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    script:
      image: alpine:latest
      command: [sh, -c]
      args:
        - echo "[$(for i in $(seq 100) ; do seq 0 9 | tr '\n' ','; done )10]"

  - name: sleep-n-sec
    inputs:
      parameters:
      - name: seconds
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo sleeping for {{inputs.parameters.seconds}} seconds; sleep {{inputs.parameters.seconds}}; echo done"]
---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: loops-param-result-c
spec:
  entrypoint: loop-param-result-example
  serviceAccountName: argo
  templates:
  - name: loop-param-result-example
    parallelism: 300
    steps:
    - - name: generate
        template: gen-number-list
    # Iterate over the list of numbers generated by the generate step above
    - - name: sleep
        template: sleep-n-sec
        arguments:
          parameters:
          - name: seconds
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"

  # Generate a list of numbers in JSON format
  - name: gen-number-list
    script:
      image: alpine:latest
      command: [sh, -c]
      args:
        - echo "[$(for i in $(seq 100) ; do seq 0 9 | tr '\n' ','; done )10]"

  - name: sleep-n-sec
    inputs:
      parameters:
      - name: seconds
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo sleeping for {{inputs.parameters.seconds}} seconds; sleep {{inputs.parameters.seconds}}; echo done"]

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@alexec
Copy link
Contributor

alexec commented Jul 13, 2020

Argo Workflow never changes resource quotas directly, so it unless the change to resource quotes somehow as a side-effect of creating pods, we are not sure what is happening here.

@bryanlarsen
Copy link
Author

It's a Kubernetes bug. kubernetes/kubernetes#67761

But it's a bug that seems fairly straightforward to work around in Argo; just retry if this particular failure condition is hit.

@Ziyang2go I think you said you had a hack fix, can you attach it?

@alexec
Copy link
Contributor

alexec commented Jul 16, 2020

@bryanlarsen are you able to provide more information about the fix please? E.g. some code?

@alexec
Copy link
Contributor

alexec commented Aug 10, 2020

time="2020-08-08T12:35:05Z" level=info msg="Failed to create pod procsk01deleteme-testpipeline-c72l7[0].procsk01deleteme[1].create-spark-app-for-a-processor (procsk01deleteme-testpipeline-c72l7-3405202402): Operation cannot be fulfilled on resourcequotas \"resource-quota\": the object has been modified; please apply your changes to the latest version and try again" namespace=data-datalake-processor19-deleteme-usw2-ppd-e2e workflow=procsk01deleteme-testpipeline-c72l7
time="2020-08-08T12:35:05Z" level=error msg="Mark error node" error="Operation cannot be fulfilled on resourcequotas \"resource-quota\": the object has been modified; please apply your changes to the latest version and try again" namespace=data-datalake-processor19-deleteme-usw2-ppd-e2e nodeName="procsk01deleteme-testpipeline-c72l7[0].procsk01deleteme[1].create-spark-app-for-a-processor" workflow=procsk01deleteme-testpipeline-c72l7

@alexec
Copy link
Contributor

alexec commented Aug 10, 2020

Must match on error message: Operation cannot be fulfilled on resourcequota

@Ziyang2go
Copy link
Contributor

It's a Kubernetes bug. kubernetes/kubernetes#67761

But it's a bug that seems fairly straightforward to work around in Argo; just retry if this particular failure condition is hit.

@Ziyang2go I think you said you had a hack fix, can you attach it?

@bryanlarsen The workaround that we did is to requeue the workflow when we have create pod error rather than fail the workflow. That can work for us because we have pre-defined workflow template and the only error that we saw in creating pod is this error and it can resolve itself in the retry process.

we are on argo version 2.4.3 and here is the patch that we made:

--- a/workflow/controller/operator.go
+++ b/workflow/controller/operator.go
@@ -1437,10 +1437,20 @@ func (woc *wfOperationCtx) checkParallelism(tmpl *wfv1.Template, node *wfv1.Node
        return nil
 }
 
+func (woc *wfOperationCtx) markNodePending(nodeName string, err error) *wfv1.NodeStatus {
+       woc.log.Infof("Mark node %s as Pending, due to: %+v", nodeName, err)
+       node := woc.getNodeByName(nodeName)
+       return woc.markNodePhase(nodeName, wfv1.NodePending, fmt.Sprintf("Pending %s", time.Since(node.StartedAt.Time)))
+}
+
 func (woc *wfOperationCtx) executeContainer(nodeName string, tmpl *wfv1.Template, boundaryID string) error {
        woc.log.Debugf("Executing node %s with container template: %v\n", nodeName, tmpl)
        _, err := woc.createWorkflowPod(nodeName, *tmpl.Container, tmpl, false)
-       return err
+       if err != nil {
+               woc.requeue()
+               woc.markNodePending(nodeName, err)
+       }
+       return nil
 }
 

@alexec
Copy link
Contributor

alexec commented Aug 12, 2020

@Ziyang2go @bryanlarsen if I create a development build of the workflow controller - could you please test it? Thank you.

@iMoses
Copy link

iMoses commented Aug 15, 2020

I'm getting the exact same error when trying to use minio as an atrifactory.
Any way to workaround it without changing code? On the cloud provider maybe.. I'm using AKS.

@alexec I'd be happy to test any such solution

@alexec
Copy link
Contributor

alexec commented Aug 17, 2020

@iMoses I have pushed argoproj/workflow-controller:fix-3217 for testing. There is no workaround AFAIK.

@iMoses
Copy link

iMoses commented Aug 18, 2020

I'm still getting the same error with the patched image:

time="2020-08-18T10:33:23Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"elasticdump-backup-qlqpr\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=stage0 workflow=elasticdump-backup-qlqpr
Containers:
  controller:
    Container ID:  docker://67f4ddfda090923240c8309af0b407f56660dfef44418f90267484e9ccf4b6d7
    Image:         argoproj/workflow-controller:fix-3217
    Image ID:      docker-pullable://argoproj/workflow-controller@sha256:0b47d257f35acfecc84b3781811243270ac849e68b3b8d50754b3d30e32dd10b
    Port:          <none>
    Host Port:     <none>
    Command:
      workflow-controller

@alexec
Copy link
Contributor

alexec commented Aug 18, 2020

thank you @iMoses - can you try with resubmitPendingPods please?

https://github.com/argoproj/argo/blob/master/docs/fields.md#fields-14

@alexec
Copy link
Contributor

alexec commented Aug 18, 2020

I've created a new image for testing if you would like to try it: argoproj/workflow-controller:fix-3791 .

@alexec
Copy link
Contributor

alexec commented Aug 22, 2020

I've created another test image: argoproj/workflow-controller:fix-3791.

Can you please try it out to confirm it fixes your problem?

@alexec
Copy link
Contributor

alexec commented Sep 2, 2020

Available for testing in v2.11.0-rc1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment