Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] workflow resource leaking when no run resource found in KFP DB #6189

Closed
Bobgy opened this issue Jul 29, 2021 · 10 comments
Closed

[bug] workflow resource leaking when no run resource found in KFP DB #6189

Bobgy opened this issue Jul 29, 2021 · 10 comments
Labels
area/backend area/testing kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Jul 29, 2021

What steps did you take

  1. connect to kfp-standalone-1 cluster in kfp-ci project

  2. count current workflows -- 1252

    kubectl get workflow | wc -l
        1252
  3. confirm current workflow ages:

    kubectl get workflow | less

What happened:

I found many workflows with age greater than 1d, our configured workflow GC time.
Because of the issue, there are too many Pods on each node and crashing GKE metrics server.

What did you expect to happen:

Workflows should be GCed after being persisted to KFP DB.

Environment:

  • How do you deploy Kubeflow Pipelines (KFP)? standalone
  • KFP version: 1.7.0-rc.2

Labels

/area backend

/area testing


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 29, 2021

Found logs:

2021-07-29 06:30:02.824 HKTtime="2021-07-28T22:30:02Z" level=error msg="Transient failure while syncing resource (kubeflow/add-pipeline-28qf9): CustomError (code: 0): Syncing Workflow (add-pipeline-28qf9): transient failure: CustomError (code: 0): Error while reporting workflow resource (code: Internal, message: Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run): rpc error: code = Internal desc = Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run, <deleted-the-entire-workflow-spec>: rpc error: code = Internal desc = Report workflow failed.: Failed to update the run.: InternalServerError: Failed to update run 074cc620-ad81-4c2a-bff2-ad6afd7eb077. Row not found: Failed to update run"

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 29, 2021

I cannot find other logs in persistence agent and api server about this workflow.

I propose to mitigate this problem by:

when reporting a workflow, if the workflow is not in KFP DB, we should delete it to avoid leaking resources.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 29, 2021

Meanwhile, it's important to keep debugging why these workflows were leaked, there might be other bugs hidden somewhere.

@Bobgy Bobgy changed the title [bug] workflow resource leaking on kfp-ci project [bug] workflow resource leaking when no run resource found in KFP DB Jul 29, 2021
@Bobgy
Copy link
Contributor Author

Bobgy commented Jul 29, 2021

With the mitigation, remaining workflow count is reduced to:

kubectl get wf  | wc -l
     377

I further found a different reason for leaked workflow: #6192

google-oss-robot pushed a commit that referenced this issue Aug 5, 2021
#6189 (#6190)

* fix(backend): argo workflow not found in KFP DB should be GCed

* Update backend/src/apiserver/resource/resource_manager.go

Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com>

* address feedback

* tidy

* Update backend/src/apiserver/resource/resource_manager.go

Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com>

Co-authored-by: Niklas Hansson <niklas.sven.hansson@gmail.com>
@jli
Copy link
Contributor

jli commented Feb 17, 2022

FWIW, I ran into this issue twice in the past week. Running KFP 1.7.1. I didn't notice any other interesting log messages around this time. Let me know if I can provide more debug info.

log message from ml-pipelien:

"Cannot find reported workflow name="backtesting-96h6b" namespace="kubeflow" runId="c7901d62-971e-4071-b9ce-56d56ddbda37" in run store. Deleting the workflow to avoid resource leaking. This can be caused by installing two KFP instances that try to manage the same workflows or an unknown bug. If you encounter this, recommend reporting more details in #6189."

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 17, 2022

Thank you! Did this cause any problems for you @jli ?

/cc @chensun

@Bobgy Bobgy removed their assignment Feb 17, 2022
@jli
Copy link
Contributor

jli commented Feb 18, 2022

It was just confusing, and caused some delay for me.

I launched a workflow, then came to check on it hours later, but found that it hadn't run.

The run details page wouldn't load (constant spinner over the dag part of the page). I checked the experiment page, and my run was there but with a grey question mark status icon. When I looked at workflow objects on k8s, I couldn't find anything.

@jli
Copy link
Contributor

jli commented Jun 27, 2022

This is happening more and more frequently for my team - multiple times a week now. It's disruptive for us, because oncall needs to look up what workflow was lost and notify users that their run is never going to work.

Are there any workarounds we could try?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 19, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend area/testing kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
No open projects
Status: Closed
Development

No branches or pull requests

3 participants