Skip to content

Commit

Permalink
fix: set job's backoffLimit to 0 and report job's active/ready status…
Browse files Browse the repository at this point in the history
… in workspace (#575)

For Job workload, the backoffLimit specifies the number of retries
before marking this job failed. Defaults to 6. In Kaito finetuning,
there are not many benefits to recreate pods 6 times if the tuning
fails, because the failures usually require user interventions.

This change also reports the job's ready pod count in workspace CR so
that users do not need to frequently check the job object for status
check. When tuning job is working, the ready count is 1. If the tuning
container completes but the docker sidecar containers fails with
infinite retry, the ready count becomes 0.

Add another tip in the troubleshoot guide.
  • Loading branch information
Fei-Guo authored Aug 22, 2024
1 parent 5f2f531 commit d4db63d
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/tuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,6 @@ The training job can take a long time depending on the size of the input dataset
```
total steps = number of epochs * (number of samples in dataset / batch size)
```
where `number of epochs` and `batch size` can be customized in the tuning configmap. However, if the `max_steps` parameter is also specified in the configmap, training will stop after reaching the max steps, even if the specified epochs have not been completed.
where `number of epochs` and `batch size` can be customized in the tuning configmap. However, if the `max_steps` parameter is also specified in the configmap, training will stop after reaching the max steps, even if the specified epochs have not been completed. Users can track the tuning progress in the job pod's log, reported by the number of steps completed out of the total.

Please file issues if you experience abnormal slowness of the training job.
6 changes: 5 additions & 1 deletion pkg/controllers/workspace_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -161,8 +161,12 @@ func (c *WorkspaceReconciler) addOrUpdateWorkspace(ctx context.Context, wObj *ka
return reconcile.Result{}, updateErr
}
} else { // The job is still running
var readyPod int32
if job.Status.Ready != nil {
readyPod = *job.Status.Ready
}
if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1alpha1.WorkspaceConditionTypeSucceeded, metav1.ConditionFalse,
"workspacePending", "workspace has not completed"); updateErr != nil {
"workspacePending", fmt.Sprintf("workspace has not completed, tuning job has %d active pod, %d ready pod", job.Status.Active, readyPod)); updateErr != nil {
klog.ErrorS(updateErr, "failed to update workspace status", "workspace", klog.KObj(wObj))
return reconcile.Result{}, updateErr
}
Expand Down
2 changes: 2 additions & 0 deletions pkg/resources/manifests.go
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@ func GenerateTuningJobManifest(ctx context.Context, wObj *kaitov1alpha1.Workspac
},
}, sidecarContainers...)

var numBackoff int32
return &batchv1.Job{
TypeMeta: v1.TypeMeta{
APIVersion: "batch/v1",
Expand All @@ -235,6 +236,7 @@ func GenerateTuningJobManifest(ctx context.Context, wObj *kaitov1alpha1.Workspac
},
},
Spec: batchv1.JobSpec{
BackoffLimit: &numBackoff, // default is 6. A failed tuning job is unlikely to be self-recoverable, no need to recreate the pod.
Template: corev1.PodTemplateSpec{
ObjectMeta: v1.ObjectMeta{
Labels: labels,
Expand Down

0 comments on commit d4db63d

Please sign in to comment.