add request routers - least kv cache, least expected latency #543

Aspirin96 · 2024-12-27T06:25:45Z

Pull Request Description

implement least KV cache router, which directs requests to the pod with least KV cache.
implement least latency router, which directs requests to the pod with least expected total latency (queue + prefill + decode)

Related Issues

Resolves: #303

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

…lization)

* add scheduler metrics * add metrics into mock app * refact CacheUsagePerc of CPU and GPU * add instance label into promQL

Change-Id: Icc2a017cb2db445fb760ced2c0034a65f9b37fa8

Change-Id: I36a0f54ca1c8a3c16b89c0077df77a119440bed3

Jeffwan · 2024-12-27T18:46:59Z

pkg/metrics/metrics.go

+			MetricType: MetricType{
+				Query: PromQL,
+			},
+			PromQL:      `increase(vllm:request_prompt_tokens_sum{instance="${instance}", model_name="${model_name}", job="pods"}[1d]) / increase(vllm:request_prompt_tokens_count{instance="${instance}", model_name="${model_name}", job="pods"}[1d])`,


how was 1d decided?

The metrics AvgPromptToksPerReq and AvgGenerationToksPerReq are used for token length prediction, which should be stable and reflect the overall distribution, so we chose a long time window of 1d.

Jeffwan · 2024-12-27T18:48:20Z

pkg/plugins/gateway/algorithms/least_latency.go

+			cntGeneration += 1
+		}
+	}
+	guessPromptTokens := 10.0


I am curious about the 10.0 and following 100.0. are these numbers meaningful or just for initializaiton?

just for initialization, need experiments to be tuned.

Jeffwan · 2024-12-27T18:51:36Z

pkg/plugins/gateway/algorithms/least_busy_time.go

+			continue
+		}
+
+		busyTimeRatio, err := r.cache.GetPodMetric(pod.Name, "gpu_busy_time_ratio") // todo: replace mock


what metric do you expect here? GPU utilization?

Yes, GPU utilization measured by busy time ratio. We've discussed with @brosoul and this metric will be added later.

sounds good. for metrics not that mature, we can add some // TODOs and do not enable the policy in the initialization, which means we can still merge the PR without waiting.

* Add random adapter scheduler * Add leastExpectedLatency request router * Add least latency scheduler * Add least kv cache router * Add bin packing scheduler (first-fit as examole) * Add least utilization scheduler (RPM, TPM, kv_cache, busy_time as utilization) * Add least busy time (or least gpu utilization) router * Add weighted round robin router * Add metrics that scheduling needed (#486) * add scheduler metrics * add metrics into mock app * refact CacheUsagePerc of CPU and GPU * add instance label into promQL * 适配metrics接口 Change-Id: Icc2a017cb2db445fb760ced2c0034a65f9b37fa8 * add .vscode to gitignore Change-Id: I36a0f54ca1c8a3c16b89c0077df77a119440bed3 * fix mock cpu_cache_usage_perc metrics * feat: add least kv cache into route strategy * add 2 new routers * rm stateful router: weighted round robin * rm scheduler changes --------- Co-authored-by: chenbinbin <chenbinbin.1996@bytedance.com> Co-authored-by: chenzuzhi <chenzuzhi@bytedance.com> Co-authored-by: brosoul <brosoul@126.com>

ChenBinbin1996 and others added 16 commits December 16, 2024 14:16

Add random adapter scheduler

e9f0870

Add leastExpectedLatency request router

1da1323

Add least latency scheduler

20a750a

Add least kv cache router

4db3894

Add bin packing scheduler (first-fit as examole)

3e1f5ad

Add least utilization scheduler (RPM, TPM, kv_cache, busy_time as uti…

6da3e01

…lization)

Add least busy time (or least gpu utilization) router

ec0e302

Add weighted round robin router

2140ba1

Add metrics that scheduling needed (#486)

490f153

* add scheduler metrics * add metrics into mock app * refact CacheUsagePerc of CPU and GPU * add instance label into promQL

适配metrics接口

1616aaa

Change-Id: Icc2a017cb2db445fb760ced2c0034a65f9b37fa8

add .vscode to gitignore

63e289d

Change-Id: I36a0f54ca1c8a3c16b89c0077df77a119440bed3

fix mock cpu_cache_usage_perc metrics

77ed626

feat: add least kv cache into route strategy

f933884

add 2 new routers

c91443f

rm stateful router: weighted round robin

6b671be

rm scheduler changes

ec2cff9

Aspirin96 force-pushed the binbin/router branch from e1e2fee to ec2cff9 Compare December 27, 2024 07:01

Aspirin96 requested review from Jeffwan and brosoul December 27, 2024 07:35

Jeffwan reviewed Dec 27, 2024

View reviewed changes

Aspirin96 changed the title ~~[DO NOT MERGE] add 2 new routers~~ [DO NOT MERGE] add request routers - least kv cache, least expected latency Dec 30, 2024

Aspirin96 changed the title ~~[DO NOT MERGE] add request routers - least kv cache, least expected latency~~ add request routers - least kv cache, least expected latency Jan 3, 2025

Merge branch 'main' into binbin/router

22e0fa6

Aspirin96 force-pushed the binbin/router branch from b4d67ba to 22e0fa6 Compare January 3, 2025 03:49

Aspirin96 merged commit ca7b372 into main Jan 3, 2025
10 checks passed

Aspirin96 deleted the binbin/router branch January 3, 2025 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add request routers - least kv cache, least expected latency #543

add request routers - least kv cache, least expected latency #543

Aspirin96 commented Dec 27, 2024

Jeffwan Dec 27, 2024

Aspirin96 Dec 30, 2024

Jeffwan Dec 30, 2024 •

edited

Loading

Jeffwan Dec 27, 2024

Aspirin96 Dec 30, 2024

Jeffwan Dec 27, 2024

Aspirin96 Dec 30, 2024

Jeffwan Dec 30, 2024

add request routers - least kv cache, least expected latency #543

add request routers - least kv cache, least expected latency #543

Conversation

Aspirin96 commented Dec 27, 2024

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Jeffwan Dec 27, 2024

Choose a reason for hiding this comment

Aspirin96 Dec 30, 2024

Choose a reason for hiding this comment

Jeffwan Dec 30, 2024 • edited Loading

Choose a reason for hiding this comment

Jeffwan Dec 27, 2024

Choose a reason for hiding this comment

Aspirin96 Dec 30, 2024

Choose a reason for hiding this comment

Jeffwan Dec 27, 2024

Choose a reason for hiding this comment

Aspirin96 Dec 30, 2024

Choose a reason for hiding this comment

Jeffwan Dec 30, 2024

Choose a reason for hiding this comment

Jeffwan Dec 30, 2024 •

edited

Loading