-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend workflow controller to handle creating pods in namespaces with a resource quota and limit range #1096
Conversation
Merge upstream master to my fork
…ation in that case.
…s to createWorkflowPod.
…s but has a failed quota error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs better comments.
Perhaps throttle retries? |
Should |
workflow/controller/operator.go
Outdated
_, err := woc.createWorkflowPod(nodeName, *tmpl.Container, tmpl) | ||
if err != nil { | ||
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error()) | ||
if strings.Contains(err.Error(), exceededQuota) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the strings.Contains call into a function called ExceededQuota or something?
workflow/controller/operator.go
Outdated
_, err := woc.createWorkflowPod(nodeName, mainCtr, tmpl) | ||
if err != nil { | ||
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error()) | ||
if strings.Contains(err.Error(), exceededQuota) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the strings.Contains call into a function called ExceededQuota or something?
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error()) | ||
} | ||
} | ||
if skipNodeInitialization { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch is only reached when a node exists but the pod failed to create because of an exceededQuota
error. Should a sleep be added here to avoid slamming the master with Pod create requests?
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodeError, err.Error()) | ||
} | ||
} | ||
if skipNodeInitialization { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch is only reached when a node exists but the pod failed to create because of an exceededQuota
error. Should a sleep be added here to avoid slamming the master with Pod create requests?
} | ||
return woc.initializeNode(nodeName, wfv1.NodeTypePod, tmpl.Name, boundaryID, wfv1.NodePending) | ||
if skipNodeInitialization { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch is only reached when a node exists but the pod failed to create because of an exceededQuota
error. Should a sleep be added here to avoid slamming the master with Pod create requests?
This PR has had no activity in 1 year. Closing. |
This PR aims to fix an existing issue with argo: creating workflow pods in namespaces with kubernetes resource quotas.
Here is an example resource quota:
As of now, creating a workflow in a namespace with a resource quota (and no limit range) would give an error like:
The message says,
"failed quota"
because the container(s) argo injects into a workflow do not spec those resources and if the user doesn't add resources to their container specs in the flow. If we specify a limit range like:we solve a part of the problem. The workflow will now fail when there are not enough resources to create a pod. The error looks like:
To get (infinite) retries to create a workflow pod, some code needed to be added. With this PR, workflows can be dispatched to a namespace with a resource quota and a limit range. They will eventually succeed.
A downside to the current implementation is there is no rate limiting for trying to create a pod that previously failed because of resource constraints. This may overload the k8s master and generally just cost a lot of cpu.