fix: set job's backoffLimit to 0 and report job's active/ready status…

… in workspace (#575) For Job workload, the backoffLimit specifies the number of retries before marking this job failed. Defaults to 6. In Kaito finetuning, there are not many benefits to recreate pods 6 times if the tuning fails, because the failures usually require user interventions. This change also reports the job's ready pod count in workspace CR so that users do not need to frequently check the job object for status check. When tuning job is working, the ready count is 1. If the tuning container completes but the docker sidecar containers fails with infinite retry, the ready count becomes 0. Add another tip in the troubleshoot guide.
kaito-project · Aug 22, 2024 · d4db63d · d4db63d
1 parent 5f2f531
commit d4db63d
Show file tree

Hide file tree

Showing 3 changed files with 8 additions and 2 deletions.
diff --git a/docs/tuning/README.md b/docs/tuning/README.md
@@ -136,6 +136,6 @@ The training job can take a long time depending on the size of the input dataset
 ```
 total steps = number of epochs * (number of samples in dataset / batch size)
 ```
-where `number of epochs` and `batch size` can be customized in the tuning configmap. However, if the `max_steps` parameter is also specified in the configmap, training will stop after reaching the max steps, even if the specified epochs have not been completed.
+where `number of epochs` and `batch size` can be customized in the tuning configmap. However, if the `max_steps` parameter is also specified in the configmap, training will stop after reaching the max steps, even if the specified epochs have not been completed. Users can track the tuning progress in the job pod's log, reported by the number of steps completed out of the total.
 
 Please file issues if you experience abnormal slowness of the training job.
diff --git a/pkg/controllers/workspace_controller.go b/pkg/controllers/workspace_controller.go
@@ -161,8 +161,12 @@ func (c *WorkspaceReconciler) addOrUpdateWorkspace(ctx context.Context, wObj *ka
 					return reconcile.Result{}, updateErr
 				}
 			} else { // The job is still running
+				var readyPod int32
+				if job.Status.Ready != nil {
+					readyPod = *job.Status.Ready
+				}
 				if updateErr := c.updateStatusConditionIfNotMatch(ctx, wObj, kaitov1alpha1.WorkspaceConditionTypeSucceeded, metav1.ConditionFalse,
-					"workspacePending", "workspace has not completed"); updateErr != nil {
+					"workspacePending", fmt.Sprintf("workspace has not completed, tuning job has %d active pod, %d ready pod", job.Status.Active, readyPod)); updateErr != nil {
 					klog.ErrorS(updateErr, "failed to update workspace status", "workspace", klog.KObj(wObj))
 					return reconcile.Result{}, updateErr
 				}

diff --git a/pkg/resources/manifests.go b/pkg/resources/manifests.go
@@ -215,6 +215,7 @@ func GenerateTuningJobManifest(ctx context.Context, wObj *kaitov1alpha1.Workspac
 		},
 	}, sidecarContainers...)
 
+	var numBackoff int32
 	return &batchv1.Job{
 		TypeMeta: v1.TypeMeta{
 			APIVersion: "batch/v1",
@@ -235,6 +236,7 @@ func GenerateTuningJobManifest(ctx context.Context, wObj *kaitov1alpha1.Workspac
 			},
 		},
 		Spec: batchv1.JobSpec{
+			BackoffLimit: &numBackoff, // default is 6. A failed tuning job is unlikely to be self-recoverable, no need to recreate the pod.
 			Template: corev1.PodTemplateSpec{
 				ObjectMeta: v1.ObjectMeta{
 					Labels: labels,