[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

Jeffwan · 2024-11-26T08:28:06Z

Summary

As the demand for large-scale model serving increases, ensuring consistent GPU availability has become a challenge, particularly within regions where identical GPU types are often unavailable due to capacity constraints. This shortage necessitates the use of heterogeneous GPUs to support the same model deployment, such as mixing L40 GPUs with A800s. Additionally, for latency-insensitive applications, users may seek to incorporate lower-cost, lower-performance GPUs to reduce expenses. However, this approach introduces complexities in determining the optimal GPU combinations and routing strategies to balance performance and cost.

Key Features:

Model Profiling for Performance Estimation: A profiling framework to predict model performance across various GPU types, enabling informed decisions about resource allocation.
Custom Autoscaler for Heterogeneous Scaling: A tailored autoscaler that manages scaling across deployments and GPU types, ensuring efficient utilization of heterogeneous resources.
Heterogeneous Load-Balancing Algorithm: A load-balancing strategy that optimally distributes requests across GPUs with differing computational capabilities, maximizing efficiency while balancing requests.

Motivation

No response

Proposed Change

No response

Alternatives Considered

No response

Jeffwan added kind/enhancement New feature or request area/autoscaling area/gateway priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 26, 2024

Jeffwan added this to the v0.2.0 milestone Nov 26, 2024

Jeffwan assigned zhangjyr and nwangfw Nov 26, 2024

Jeffwan mentioned this issue Nov 27, 2024

[feat]: GPU Optimizer and Simulator development app #430

Merged

Jeffwan closed this as completed in #430 Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

Jeffwan commented Nov 26, 2024

[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

Comments

Jeffwan commented Nov 26, 2024

Summary

Motivation

Proposed Change

Alternatives Considered