Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

Closed
Jeffwan opened this issue Nov 26, 2024 · 0 comments · Fixed by #430
Closed

[RFC]: Cost-efficient LLM Serving with GPU Heterogeneity #435

Jeffwan opened this issue Nov 26, 2024 · 0 comments · Fixed by #430
Assignees
Labels
area/autoscaling area/gateway kind/enhancement New feature or request kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 26, 2024

Summary

As the demand for large-scale model serving increases, ensuring consistent GPU availability has become a challenge, particularly within regions where identical GPU types are often unavailable due to capacity constraints. This shortage necessitates the use of heterogeneous GPUs to support the same model deployment, such as mixing L40 GPUs with A800s. Additionally, for latency-insensitive applications, users may seek to incorporate lower-cost, lower-performance GPUs to reduce expenses. However, this approach introduces complexities in determining the optimal GPU combinations and routing strategies to balance performance and cost.

Key Features:

  1. Model Profiling for Performance Estimation: A profiling framework to predict model performance across various GPU types, enabling informed decisions about resource allocation.
  2. Custom Autoscaler for Heterogeneous Scaling: A tailored autoscaler that manages scaling across deployments and GPU types, ensuring efficient utilization of heterogeneous resources.
  3. Heterogeneous Load-Balancing Algorithm: A load-balancing strategy that optimally distributes requests across GPUs with differing computational capabilities, maximizing efficiency while balancing requests.

Motivation

No response

Proposed Change

No response

Alternatives Considered

No response

@Jeffwan Jeffwan added kind/enhancement New feature or request area/autoscaling area/gateway priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 26, 2024
@Jeffwan Jeffwan added this to the v0.2.0 milestone Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscaling area/gateway kind/enhancement New feature or request kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants