You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I deployed both A100 and A40 GPUs but only included profiling for the A100 in the GPU optimizer. The optimizer’s output appears correct, as it only increases the replica number for the A100. However, the A40 unexpectedly scaled up as well.
Steps to Reproduce
Using the configuration from using the development/app/config/heterogeneous to deploy a100 and a40 heterogeneous scenario
Using "make debug" in python/aibrix/aibrix/gpu_optimizer/Makefile to generate requests
Expected behavior
No response
Environment
No response
The text was updated successfully, but these errors were encountered:
As no labelsSelector was set, the originalReadyPodsCount will now count all pods in the namespace. I hope to know the reason for this design and possibly keep it up to date or at least support how we use the scaleTargetRef now. Following is how I use it:
Noted: after further inspection, the problem here is that the scale might not populate spec.selector if scale is a deployment and labelsSelector is empty. We'll need a generic fix for this.
Update: I found spec.selector is populated. However, pods are counted by models instead of by deployments, which contradicts the scaleTargetRef definition.
I finally identified the problem was due to my misconfiguration of deployments. Previously spec.selector.matchLabels contains model info only and can't identify deployment itself. If correctly configured, no extra code modification is needed besides #508.
I double-checked pods count calculation logic. Using spec.selector.matchLabels is common practice and there is no problem with the design.
🐛 Describe the bug
I deployed both A100 and A40 GPUs but only included profiling for the A100 in the GPU optimizer. The optimizer’s output appears correct, as it only increases the replica number for the A100. However, the A40 unexpectedly scaled up as well.
Steps to Reproduce
Using the configuration from using the development/app/config/heterogeneous to deploy a100 and a40 heterogeneous scenario
Using "make debug" in python/aibrix/aibrix/gpu_optimizer/Makefile to generate requests
Expected behavior
No response
Environment
No response
The text was updated successfully, but these errors were encountered: