Skip to content

Commit

Permalink
feat: Ensure modelmesh container comes last in list (#202)
Browse files Browse the repository at this point in the history
#### Motivation

Kubernetes starts containers sequentially in order, which can mean the start of later containers is held up if their images have to be pulled. For large model server images this can cause a problem because model-mesh waits for a limited amount of time at startup for the runtime to become ready before returning ready from its own probe.

Making the `mm` container last in the list ensures that no image pulling time will be included in this startup polling time and avoids unnecessary timeouts / extended readiness probe failures.

#### Modifications

- Add `ensureMMContainerIsLast` function to runtime deployment reconciliation logic which runs after other modifications to the pod spec.
- Update unit test fixtures accordingly

#### Result

More stable deployments in cases where the model server image takes a long time to pull.

Signed-off-by: Nick Hill <nickhill@us.ibm.com>

Signed-off-by: Nick Hill <nickhill@us.ibm.com>
  • Loading branch information
njhill authored Aug 3, 2022
1 parent c79fa0a commit d879588
Show file tree
Hide file tree
Showing 2 changed files with 495 additions and 745 deletions.
19 changes: 19 additions & 0 deletions controllers/modelmesh/modelmesh.go
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ func (m *Deployment) Apply(ctx context.Context) error {
m.configureMMDeploymentForTLSSecret,
m.configureRuntimePodSpecAnnotations,
m.configureRuntimePodSpecLabels,
m.ensureMMContainerIsLast,
); tErr != nil {
return tErr
}
Expand Down Expand Up @@ -156,6 +157,24 @@ func (m *Deployment) Apply(ctx context.Context) error {
return nil
}

// Kubernetes starts containers sequentially in order, which can mean the start of later
// containers is held up if their images have to be pulled. For large model server images
// this can cause a problem because model-mesh waits for a limited amount of time at
// startup for the runtime to become ready before returning ready from its own probe.
// Making the mm container last in the list ensures that no image pulling time will be
// included in this startup polling time and avoids unnecessary timeouts.
func (m *Deployment) ensureMMContainerIsLast(deployment *appsv1.Deployment) error {
if i, _ := findContainer(ModelMeshContainerName, deployment); i >= 0 {
last := len(deployment.Spec.Template.Spec.Containers) - 1
if i != last {
c := deployment.Spec.Template.Spec.Containers[last]
deployment.Spec.Template.Spec.Containers[last] = deployment.Spec.Template.Spec.Containers[i]
deployment.Spec.Template.Spec.Containers[i] = c
}
}
return nil
}

func (m *Deployment) Delete(ctx context.Context, client client.Client) error {
m.Log.Info("Deleting modelmesh deployment ", "name", m.Name, "namespace", m.Namespace)
return config.Delete(client, m.Owner, "config/internal/base/deployment.yaml.tmpl", m)
Expand Down
Loading

0 comments on commit d879588

Please sign in to comment.