-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] workflow resource leaking when no run resource found in KFP DB #6189
Comments
Found logs: 2021-07-29 06:30:02.824 HKTtime="2021-07-28T22:30:02Z" level=error msg="Transient failure while syncing resource (kubeflow/add-pipeline-28qf9): CustomError (code: 0): Syncing Workflow (add-pipeline-28qf9): transient failure: CustomError (code: 0): Error while reporting workflow resource (code: Internal, message: Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run): rpc error: code = Internal desc = Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run, <deleted-the-entire-workflow-spec>: rpc error: code = Internal desc = Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run" |
I cannot find other logs in persistence agent and api server about this workflow. I propose to mitigate this problem by: when reporting a workflow, if the workflow is not in KFP DB, we should delete it to avoid leaking resources. |
Meanwhile, it's important to keep debugging why these workflows were leaked, there might be other bugs hidden somewhere. |
With the mitigation, remaining workflow count is reduced to:
I further found a different reason for leaked workflow: #6192 |
#6189 (#6190) * fix(backend): argo workflow not found in KFP DB should be GCed * Update backend/src/apiserver/resource/resource_manager.go Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com> * address feedback * tidy * Update backend/src/apiserver/resource/resource_manager.go Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com> Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com>
FWIW, I ran into this issue twice in the past week. Running KFP 1.7.1. I didn't notice any other interesting log messages around this time. Let me know if I can provide more debug info. log message from
|
It was just confusing, and caused some delay for me. I launched a workflow, then came to check on it hours later, but found that it hadn't run. The run details page wouldn't load (constant spinner over the dag part of the page). I checked the experiment page, and my run was there but with a grey question mark status icon. When I looked at workflow objects on k8s, I couldn't find anything. |
This is happening more and more frequently for my team - multiple times a week now. It's disruptive for us, because oncall needs to look up what workflow was lost and notify users that their run is never going to work. Are there any workarounds we could try? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
What steps did you take
connect to kfp-standalone-1 cluster in kfp-ci project
count current workflows -- 1252
kubectl get workflow | wc -l 1252
confirm current workflow ages:
kubectl get workflow | less
What happened:
I found many workflows with age greater than 1d, our configured workflow GC time.
Because of the issue, there are too many Pods on each node and crashing GKE metrics server.
What did you expect to happen:
Workflows should be GCed after being persisted to KFP DB.
Environment:
Labels
/area backend
/area testing
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: