[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435
Labels
area/autoscaling
area/gateway
kind/enhancement
New feature or request
kind/feature
Categorizes issue or PR as related to a new feature.
priority/critical-urgent
Highest priority. Must be actively worked on as someone's top priority right now.
Milestone
Summary
As the demand for large-scale model serving increases, ensuring consistent GPU availability has become a challenge, particularly within regions where identical GPU types are often unavailable due to capacity constraints. This shortage necessitates the use of heterogeneous GPUs to support the same model deployment, such as mixing L40 GPUs with A800s. Additionally, for latency-insensitive applications, users may seek to incorporate lower-cost, lower-performance GPUs to reduce expenses. However, this approach introduces complexities in determining the optimal GPU combinations and routing strategies to balance performance and cost.
Key Features:
Motivation
No response
Proposed Change
No response
Alternatives Considered
No response
The text was updated successfully, but these errors were encountered: