Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KPA scaling results in heterogeneous GPU #507

Closed
nwangfw opened this issue Dec 9, 2024 · 3 comments · Fixed by #513
Closed

KPA scaling results in heterogeneous GPU #507

nwangfw opened this issue Dec 9, 2024 · 3 comments · Fixed by #513
Assignees
Milestone

Comments

@nwangfw
Copy link
Collaborator

nwangfw commented Dec 9, 2024

🐛 Describe the bug

I deployed both A100 and A40 GPUs but only included profiling for the A100 in the GPU optimizer. The optimizer’s output appears correct, as it only increases the replica number for the A100. However, the A40 unexpectedly scaled up as well.

Pasted Graphic 37

Steps to Reproduce

Using the configuration from using the development/app/config/heterogeneous to deploy a100 and a40 heterogeneous scenario
Using "make debug" in python/aibrix/aibrix/gpu_optimizer/Makefile to generate requests

Expected behavior

No response

Environment

No response

@zhangjyr
Copy link
Collaborator

zhangjyr commented Dec 9, 2024

I debugged the code and found the way that podautoscaler gets pods' count may be out of date as:

func (r *PodAutoscalerReconciler) computeReplicasForMetrics(...) {
	...
	labelsSelector, err := extractLabelSelector(scale)
	...
	originalReadyPodsCount, err := scaler.GetReadyPodsCount(ctx, r.Client, pa.Namespace, labelsSelector)
}

While scale refers to the scaleTargetRef and does not have label selectors as:

 gvk := schema.GroupVersionKind{
	Group:   mapping.GroupVersionKind.Group,
	Version: mapping.GroupVersionKind.Version,
	Kind:    mapping.GroupVersionKind.Kind,
}
scale := &unstructured.Unstructured{}
scale.SetGroupVersionKind(gvk)
scale.SetNamespace(namespace)
scale.SetName(name)

As no labelsSelector was set, the originalReadyPodsCount will now count all pods in the namespace. I hope to know the reason for this design and possibly keep it up to date or at least support how we use the scaleTargetRef now. Following is how I use it:

scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mock-llama2-7b

Noted: after further inspection, the problem here is that the scale might not populate spec.selector if scale is a deployment and labelsSelector is empty. We'll need a generic fix for this.

@zhangjyr
Copy link
Collaborator

Update: I found spec.selector is populated. However, pods are counted by models instead of by deployments, which contradicts the scaleTargetRef definition.

@zhangjyr zhangjyr assigned zhangjyr and unassigned kr11 Dec 10, 2024
@zhangjyr
Copy link
Collaborator

I finally identified the problem was due to my misconfiguration of deployments. Previously spec.selector.matchLabels contains model info only and can't identify deployment itself. If correctly configured, no extra code modification is needed besides #508.

I double-checked pods count calculation logic. Using spec.selector.matchLabels is common practice and there is no problem with the design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants