KPA scaling results in heterogeneous GPU #507

nwangfw · 2024-12-09T04:05:39Z

🐛 Describe the bug

I deployed both A100 and A40 GPUs but only included profiling for the A100 in the GPU optimizer. The optimizer’s output appears correct, as it only increases the replica number for the A100. However, the A40 unexpectedly scaled up as well.

Steps to Reproduce

Using the configuration from using the development/app/config/heterogeneous to deploy a100 and a40 heterogeneous scenario
Using "make debug" in python/aibrix/aibrix/gpu_optimizer/Makefile to generate requests

Expected behavior

No response

Environment

No response

zhangjyr · 2024-12-09T07:09:26Z

I debugged the code and found the way that podautoscaler gets pods' count may be out of date as:

func (r *PodAutoscalerReconciler) computeReplicasForMetrics(...) {
	...
	labelsSelector, err := extractLabelSelector(scale)
	...
	originalReadyPodsCount, err := scaler.GetReadyPodsCount(ctx, r.Client, pa.Namespace, labelsSelector)
}

While scale refers to the scaleTargetRef and does not have label selectors as:

 gvk := schema.GroupVersionKind{
	Group:   mapping.GroupVersionKind.Group,
	Version: mapping.GroupVersionKind.Version,
	Kind:    mapping.GroupVersionKind.Kind,
}
scale := &unstructured.Unstructured{}
scale.SetGroupVersionKind(gvk)
scale.SetNamespace(namespace)
scale.SetName(name)

As no labelsSelector was set, the originalReadyPodsCount will now count all pods in the namespace. I hope to know the reason for this design and possibly keep it up to date or at least support how we use the scaleTargetRef now. Following is how I use it:

scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mock-llama2-7b

Noted: after further inspection, the problem here is that the scale might not populate spec.selector if scale is a deployment and labelsSelector is empty. We'll need a generic fix for this.

zhangjyr · 2024-12-10T00:40:01Z

Update: I found spec.selector is populated. However, pods are counted by models instead of by deployments, which contradicts the scaleTargetRef definition.

zhangjyr · 2024-12-10T07:51:02Z

I finally identified the problem was due to my misconfiguration of deployments. Previously spec.selector.matchLabels contains model info only and can't identify deployment itself. If correctly configured, no extra code modification is needed besides #508.

I double-checked pods count calculation logic. Using spec.selector.matchLabels is common practice and there is no problem with the design.

nwangfw added area/heterogeneous area/autoscaling labels Dec 9, 2024

nwangfw added this to the v0.2.0 milestone Dec 9, 2024

zhangjyr assigned kr11 Dec 9, 2024

zhangjyr mentioned this issue Dec 9, 2024

[Bug] Fix the way how podautoscaler handle 0 pods. #508

Merged

nwangfw added the kind/bug Something isn't working label Dec 9, 2024

zhangjyr assigned zhangjyr and unassigned kr11 Dec 10, 2024

zhangjyr mentioned this issue Dec 10, 2024

[Bug] Fixed the yaml of deployments in heterogenous GPU settings to make KPA scaling work as expected. #513

Merged

Jeffwan closed this as completed in #513 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KPA scaling results in heterogeneous GPU #507

KPA scaling results in heterogeneous GPU #507

nwangfw commented Dec 9, 2024 •

edited

Loading

zhangjyr commented Dec 9, 2024 •

edited

Loading

zhangjyr commented Dec 10, 2024

zhangjyr commented Dec 10, 2024

KPA scaling results in heterogeneous GPU #507

KPA scaling results in heterogeneous GPU #507

Comments

nwangfw commented Dec 9, 2024 • edited Loading

🐛 Describe the bug

Steps to Reproduce

Expected behavior

Environment

zhangjyr commented Dec 9, 2024 • edited Loading

zhangjyr commented Dec 10, 2024

zhangjyr commented Dec 10, 2024

nwangfw commented Dec 9, 2024 •

edited

Loading

zhangjyr commented Dec 9, 2024 •

edited

Loading