Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add request routers - least kv cache, least expected latency #543

Merged
merged 17 commits into from
Jan 3, 2025

Conversation

Aspirin96
Copy link
Collaborator

Pull Request Description

  1. implement least KV cache router, which directs requests to the pod with least KV cache.
  2. implement least latency router, which directs requests to the pod with least expected total latency (queue + prefill + decode)

Related Issues

Resolves: #303

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

MetricType: MetricType{
Query: PromQL,
},
PromQL: `increase(vllm:request_prompt_tokens_sum{instance="${instance}", model_name="${model_name}", job="pods"}[1d]) / increase(vllm:request_prompt_tokens_count{instance="${instance}", model_name="${model_name}", job="pods"}[1d])`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was 1d decided?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics AvgPromptToksPerReq and AvgGenerationToksPerReq are used for token length prediction, which should be stable and reflect the overall distribution, so we chose a long time window of 1d.

Copy link
Collaborator

@Jeffwan Jeffwan Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

cntGeneration += 1
}
}
guessPromptTokens := 10.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the 10.0 and following 100.0. are these numbers meaningful or just for initializaiton?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for initialization, need experiments to be tuned.

continue
}

busyTimeRatio, err := r.cache.GetPodMetric(pod.Name, "gpu_busy_time_ratio") // todo: replace mock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what metric do you expect here? GPU utilization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, GPU utilization measured by busy time ratio. We've discussed with @brosoul and this metric will be added later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. for metrics not that mature, we can add some // TODOs and do not enable the policy in the initialization, which means we can still merge the PR without waiting.

@Aspirin96 Aspirin96 changed the title [DO NOT MERGE] add 2 new routers [DO NOT MERGE] add request routers - least kv cache, least expected latency Dec 30, 2024
@Aspirin96 Aspirin96 changed the title [DO NOT MERGE] add request routers - least kv cache, least expected latency add request routers - least kv cache, least expected latency Jan 3, 2025
@Aspirin96 Aspirin96 merged commit ca7b372 into main Jan 3, 2025
10 checks passed
@Aspirin96 Aspirin96 deleted the binbin/router branch January 3, 2025 04:03
gangmuk pushed a commit that referenced this pull request Jan 25, 2025
* Add random adapter scheduler

* Add leastExpectedLatency request router

* Add least latency scheduler

* Add least kv cache router

* Add bin packing scheduler (first-fit as examole)

* Add least utilization scheduler (RPM, TPM, kv_cache, busy_time as utilization)

* Add least busy time (or least gpu utilization) router

* Add weighted round robin router

* Add metrics that scheduling needed (#486)

* add scheduler metrics

* add metrics into mock app

* refact CacheUsagePerc of CPU and GPU

* add instance label into promQL

* 适配metrics接口

Change-Id: Icc2a017cb2db445fb760ced2c0034a65f9b37fa8

* add .vscode to gitignore

Change-Id: I36a0f54ca1c8a3c16b89c0077df77a119440bed3

* fix mock cpu_cache_usage_perc metrics

* feat: add least kv cache into route strategy

* add 2 new routers

* rm stateful router: weighted round robin

* rm scheduler changes

---------

Co-authored-by: chenbinbin <chenbinbin.1996@bytedance.com>
Co-authored-by: chenzuzhi <chenzuzhi@bytedance.com>
Co-authored-by: brosoul <brosoul@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Lora base routing algorithm
5 participants