[bug] workflow resource leaking when no run resource found in KFP DB #6189

Bobgy · 2021-07-29T03:37:51Z

What steps did you take

connect to kfp-standalone-1 cluster in kfp-ci project
count current workflows -- 1252
```
kubectl get workflow | wc -l
    1252
```
confirm current workflow ages:
```
kubectl get workflow | less
```

What happened:

I found many workflows with age greater than 1d, our configured workflow GC time.
Because of the issue, there are too many Pods on each node and crashing GKE metrics server.

What did you expect to happen:

Workflows should be GCed after being persisted to KFP DB.

Environment:

How do you deploy Kubeflow Pipelines (KFP)? standalone

KFP version: 1.7.0-rc.2

Labels

/area backend

/area testing

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Bobgy · 2021-07-29T03:43:23Z

Found logs:

2021-07-29 06:30:02.824 HKTtime="2021-07-28T22:30:02Z" level=error msg="Transient failure while syncing resource (kubeflow/add-pipeline-28qf9): CustomError (code: 0): Syncing Workflow (add-pipeline-28qf9): transient failure: CustomError (code: 0): Error while reporting workflow resource (code: Internal, message: Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run): rpc error: code = Internal desc = Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run, <deleted-the-entire-workflow-spec>: rpc error: code = Internal desc = Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run"

Bobgy · 2021-07-29T06:07:47Z

I cannot find other logs in persistence agent and api server about this workflow.

I propose to mitigate this problem by:

when reporting a workflow, if the workflow is not in KFP DB, we should delete it to avoid leaking resources.

Bobgy · 2021-07-29T06:08:23Z

Meanwhile, it's important to keep debugging why these workflows were leaked, there might be other bugs hidden somewhere.

Bobgy · 2021-07-29T09:01:39Z

With the mitigation, remaining workflow count is reduced to:

kubectl get wf  | wc -l
     377

I further found a different reason for leaked workflow: #6192

#6189 (#6190) * fix(backend): argo workflow not found in KFP DB should be GCed * Update backend/src/apiserver/resource/resource_manager.go Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com> * address feedback * tidy * Update backend/src/apiserver/resource/resource_manager.go Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com> Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com>

jli · 2022-02-17T23:18:07Z

FWIW, I ran into this issue twice in the past week. Running KFP 1.7.1. I didn't notice any other interesting log messages around this time. Let me know if I can provide more debug info.

log message from ml-pipelien:

"Cannot find reported workflow name="backtesting-96h6b" namespace="kubeflow" runId="c7901d62-971e-4071-b9ce-56d56ddbda37" in run store. Deleting the workflow to avoid resource leaking. This can be caused by installing two KFP instances that try to manage the same workflows or an unknown bug. If you encounter this, recommend reporting more details in #6189."

Bobgy · 2022-02-17T23:56:13Z

Thank you! Did this cause any problems for you @jli ?

/cc @chensun

jli · 2022-02-18T00:01:28Z

It was just confusing, and caused some delay for me.

I launched a workflow, then came to check on it hours later, but found that it hadn't run.

The run details page wouldn't load (constant spinner over the dag part of the page). I checked the experiment page, and my run was there but with a grey question mark status icon. When I looked at workflow objects on k8s, I couldn't find anything.

jli · 2022-06-27T19:58:30Z

This is happening more and more frequently for my team - multiple times a week now. It's disruptive for us, because oncall needs to look up what workflow was lost and notify users that their run is never going to work.

Are there any workarounds we could try?

github-actions · 2024-06-19T07:41:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-07-10T07:41:49Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Bobgy added the kind/bug label Jul 29, 2021

google-oss-robot added area/backend area/testing labels Jul 29, 2021

Bobgy mentioned this issue Jul 29, 2021

[release] 1.7.0 tracker #5779

Closed

23 tasks

Bobgy mentioned this issue Jul 29, 2021

chore(v2): v2 launcher with basic publisher. Fixes #6147 #6182

Merged

1 task

Bobgy changed the title ~~[bug] workflow resource leaking on kfp-ci project~~ [bug] workflow resource leaking when no run resource found in KFP DB Jul 29, 2021

Bobgy mentioned this issue Jul 29, 2021

fix(backend): argo workflow not found in KFP DB should be GCed. Part of #6189 #6190

Merged

1 task

zijianjoy assigned Bobgy Aug 13, 2021

Bobgy removed their assignment Feb 17, 2022

HumairAK mentioned this issue Nov 16, 2023

Identify upstream/pipelines issues to work on opendatahub-io/data-science-pipelines-operator#423

Open

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 19, 2024

github-actions bot closed this as completed Jul 10, 2024

github-project-automation bot added this to KFP Runtime Triage Aug 29, 2024

github-project-automation bot moved this to Closed in KFP Runtime Triage Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] workflow resource leaking when no run resource found in KFP DB #6189

[bug] workflow resource leaking when no run resource found in KFP DB #6189

Bobgy commented Jul 29, 2021 •

edited

Loading

Bobgy commented Jul 29, 2021 •

edited

Loading

Bobgy commented Jul 29, 2021 •

edited

Loading

Bobgy commented Jul 29, 2021

Bobgy commented Jul 29, 2021

jli commented Feb 17, 2022

Bobgy commented Feb 17, 2022

jli commented Feb 18, 2022

jli commented Jun 27, 2022

github-actions bot commented Jun 19, 2024

github-actions bot commented Jul 10, 2024

[bug] workflow resource leaking when no run resource found in KFP DB #6189

[bug] workflow resource leaking when no run resource found in KFP DB #6189

Comments

Bobgy commented Jul 29, 2021 • edited Loading

What steps did you take

What happened:

What did you expect to happen:

Environment:

Labels

Bobgy commented Jul 29, 2021 • edited Loading

Bobgy commented Jul 29, 2021 • edited Loading

Bobgy commented Jul 29, 2021

Bobgy commented Jul 29, 2021

jli commented Feb 17, 2022

Bobgy commented Feb 17, 2022

jli commented Feb 18, 2022

jli commented Jun 27, 2022

github-actions bot commented Jun 19, 2024

github-actions bot commented Jul 10, 2024

Bobgy commented Jul 29, 2021 •

edited

Loading

Bobgy commented Jul 29, 2021 •

edited

Loading

Bobgy commented Jul 29, 2021 •

edited

Loading