Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend workflow controller to handle creating pods in namespaces with a resource quota and limit range #1096

Closed
wants to merge 11 commits into from

Conversation

gaigepr
Copy link

@gaigepr gaigepr commented Nov 16, 2018

This PR aims to fix an existing issue with argo: creating workflow pods in namespaces with kubernetes resource quotas.

Here is an example resource quota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: q1
  namespace: sg1
spec:
  hard:
    limits.cpu: 1000m
    limits.memory: 1000Mi

As of now, creating a workflow in a namespace with a resource quota (and no limit range) would give an error like:

time="2018-11-15T23:51:35Z" level=info msg="Failed to create pod pods \"hello-world-l69s7\" is forbidden: failed quota: q1: must specify limits.cpu,limits.memory,requests.cpu,requests.memory" namespace=sg1 workflow=hello-world-l69s7

The message says, "failed quota" because the container(s) argo injects into a workflow do not spec those resources and if the user doesn't add resources to their container specs in the flow. If we specify a limit range like:

apiVersion: v1
kind: LimitRange
metadata:
  name: q1
  namespace: sg1
spec:
  limits:
  - type: Container
    default:
      cpu: 200m
      memory: 200Mi
    defaultRequest:
      cpu: 50m
      memory: 200Mi

we solve a part of the problem. The workflow will now fail when there are not enough resources to create a pod. The error looks like:

time="2018-11-16T00:14:50Z" level=info msg="Failed to create pod hello-world-2sqg5 (hello-world-2sqg5): pods \"hello-world-2sqg5\" is forbidden: exceeded quota: q1, requested: limits.cpu=250m,limits.memory=210Mi, used: limits.cpu=1,limits.memory=840Mi, limited: limits.cpu=1,limits.memory=1000Mi" namespace=sg1 workflow=hello-world-2sqg5

To get (infinite) retries to create a workflow pod, some code needed to be added. With this PR, workflows can be dispatched to a namespace with a resource quota and a limit range. They will eventually succeed.

A downside to the current implementation is there is no rate limiting for trying to create a pod that previously failed because of resource constraints. This may overload the k8s master and generally just cost a lot of cpu.

Copy link
Author

@gaigepr gaigepr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs better comments.

@gaigepr
Copy link
Author

gaigepr commented Nov 16, 2018

Perhaps throttle retries? if skipNodeInitialization { sleep ... }

@gaigepr
Copy link
Author

gaigepr commented Nov 16, 2018

Should woc implement a MarkForUpdate method in some other pr?

_, err := woc.createWorkflowPod(nodeName, *tmpl.Container, tmpl)
if err != nil {
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error())
if strings.Contains(err.Error(), exceededQuota) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the strings.Contains call into a function called ExceededQuota or something?

_, err := woc.createWorkflowPod(nodeName, mainCtr, tmpl)
if err != nil {
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error())
if strings.Contains(err.Error(), exceededQuota) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the strings.Contains call into a function called ExceededQuota or something?

return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error())
}
}
if skipNodeInitialization {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is only reached when a node exists but the pod failed to create because of an exceededQuota error. Should a sleep be added here to avoid slamming the master with Pod create requests?

return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error())
}
}
if skipNodeInitialization {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is only reached when a node exists but the pod failed to create because of an exceededQuota error. Should a sleep be added here to avoid slamming the master with Pod create requests?

}
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodePending)
if skipNodeInitialization {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch is only reached when a node exists but the pod failed to create because of an exceededQuota error. Should a sleep be added here to avoid slamming the master with Pod create requests?

@alexec
Copy link
Contributor

alexec commented Dec 19, 2019

This PR has had no activity in 1 year. Closing.

@alexec alexec closed this Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants