-
Notifications
You must be signed in to change notification settings - Fork 70
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: set job's backoffLimit to 0 and report job's active/ready status…
… in workspace (#575) For Job workload, the backoffLimit specifies the number of retries before marking this job failed. Defaults to 6. In Kaito finetuning, there are not many benefits to recreate pods 6 times if the tuning fails, because the failures usually require user interventions. This change also reports the job's ready pod count in workspace CR so that users do not need to frequently check the job object for status check. When tuning job is working, the ready count is 1. If the tuning container completes but the docker sidecar containers fails with infinite retry, the ready count becomes 0. Add another tip in the troubleshoot guide.
- Loading branch information
Showing
3 changed files
with
8 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters