scheduling changes for lora affinity load balancing #423

kaushikmitr · 2025-02-27T21:37:39Z

This pull request includes several changes to the deployment configuration, metrics collection, and scheduling logic. The most important changes include updating metrics collection to include waiting adapters, and implementing a new pod selection strategy that balances load while considering model affinity.

Scheduling Logic Enhancements:

pkg/epp/scheduling/filter.go: Replaced the loRAAffinityPredicate function with a new loRASoftAffinityPredicate function that prioritizes pods with existing model affinity while allowing for load balancing through randomization (as long as there is room to fit another adapter in the pod).
pkg/epp/scheduling/scheduler.go: Updated the scheduling configuration to use the new loRASoftAffinityPredicate function and increased the queueingThresholdLoRA value from 50 to 128. Added a new loraAffinityThreshold constant to indicate the probability of preferring pods with model affinity. [1] [2] [3]

Deployment Configuration Changes:

config/manifests/vllm/deployment.yaml: Added new command-line arguments for --compilation-config, --max-num-seqs, and --max-lora-rank. Added a new environment variable VLLM_USE_V1. [1] [2]

Metrics Collection Updates:

pkg/epp/backend/vllm/metrics.go: Added a new metric LoraRequestInfoWaitingAdaptersMetricName and updated the promToPodMetrics and getLatestLoraMetric functions to handle waiting adapters. Also pick the previous running + waiting adapters if there are no current running or waiting adapters [1] [2] [3]

k8s-ci-robot · 2025-02-27T21:37:49Z

Hi @kaushikmitr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-02-27T21:38:02Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`41ec5b8`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67c27358596edb00081c730f
😎 Deploy Preview	https://deploy-preview-423--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-02-27T21:48:33Z

/ok-to-test

ahg-g

I didn't look at the algorithm change yet, left a couple of quick comments.

ahg-g · 2025-02-27T22:08:34Z

config/manifests/vllm/deployment.yaml

@@ -24,15 +24,23 @@ spec:
          - "1"
          - "--port"
          - "8000"
+          - "--compilation-config"


what does this do?

we may not need this if using V0. It outputs the CUDA graph for optimization.

ahg-g · 2025-02-27T23:26:55Z

config/manifests/vllm/deployment.yaml

          - "--lora-modules"
          - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
          - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
          env:
+            - name: VLLM_USE_V1
+              value: "1"


The released vllm version doesn't support our metrics yet, right? if so, then we can't use it now.

Yes, that is why the tests are failing. I will switch back to V0

I don't think that is, the integration test doesn't use this deployment yaml.

I think the test is failing because this PR introduces some randomness to the selection.

this one probably is failing because of V1: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_gateway-api-inference-extension/423/pull-gateway-api-inference-extension-verify-main/1895229573222633472

ahg-g · 2025-02-27T23:57:09Z

pkg/epp/backend/vllm/metrics.go

+
+		// Ignore metrics with both labels empty.
+		if running == "" && waiting == "" {
+			//	continue


commented out code

this was a bug.

ahg-g

The algorithm is not using the waiting_lora_adapters metric, right?

pkg/epp/scheduling/scheduler.go

pkg/epp/backend/vllm/metrics.go

config/manifests/vllm/deployment.yaml

pkg/epp/backend/vllm/metrics.go

liu-cong · 2025-02-28T17:45:54Z

pkg/epp/backend/vllm/metrics.go

@@ -37,6 +37,7 @@ import (
 const (
 	LoraRequestInfoMetricName                = "vllm:lora_requests_info"
 	LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters"
+	LoraRequestInfoWaitingAdaptersMetricName = "waiting_lora_adapters"
 	LoraRequestInfoMaxAdaptersMetricName     = "max_lora"
 	// TODO: Replace these with the num_tokens_running/waiting below once we add those to the fork.


Can you clean up the TODOs and the metrics that are not currently used?

I think the TODOs are still relevant. I will remove max token in KV Cache because its not being used.

liu-cong · 2025-02-28T17:48:11Z

pkg/epp/scheduling/scheduler.go

-	// The value of 50 is arrived heuristicically based on experiments.
-	queueingThresholdLoRA = 50
+	// The value of 128 is arrived heuristicically based on experiments.
+	queueingThresholdLoRA = 128


I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.

I would rather levarage this to make this configurable. #16

I don't think we have time to do API change for the next release. Given we already had to change it on different accelerator types, it's important to have this knob configurable. Exposing it as a flag seems straightforward and gives us time to gather feedback on this before making an API change.

I took at look, iiuc, adding this flag is not straightforward, the way scheduler is written. If its needed for next release would rather have it in another PR.

Defining a flag for each parameter is tedious, we can use a versioned configuration file, this is called ComponentConfig, ideally we do that for #383

Here is JobSet's config file as an example: https://github.com/kubernetes-sigs/jobset/tree/main/api/config/v1alpha1

liu-cong · 2025-02-28T17:52:32Z

pkg/epp/scheduling/filter.go

+// Returns:
+//   - Filtered slice of pod metrics based on affinity and availability
+//   - Error if any issues occur during filtering
+func loRASoftAffinityPredicate(logger logr.Logger, req *LLMRequest, pods []*datastore.PodMetrics) ([]*datastore.PodMetrics, error) {


This is not a predicate, this is a filter, according to the current filter and predicate interface definition.

liu-cong · 2025-02-28T17:54:06Z

pkg/epp/scheduling/filter.go

+	// Categorize pods based on affinity and availability
+	for _, pod := range pods {
+		if pod == nil {
+			continue


pls add a warning log here and state that this should never happen

removed this, as this scenario is captured already upstream

liu-cong · 2025-02-28T17:55:12Z

pkg/epp/scheduling/filter.go

+
+		if _, exists := pod.ActiveModels[req.ResolvedTargetModel]; exists {
+			filtered_affinity = append(filtered_affinity, pod)
+		} else if len(pod.ActiveModels) < pod.MaxActiveModels {


This is essentially the canAcceptNewLoraPredicate function below, are we still using canAcceptNewLoraPredicate?

we are not using canAcceptNewLoraPredicate any more. But would be good to keep I think.

liu-cong · 2025-02-28T18:00:35Z

pkg/epp/scheduling/filter.go

+	}
+
+	// Use crypto/rand for better randomization in production environments
+	randSource := rand.NewSource(time.Now().UnixNano())


This can be a follow up, but it sounds like we can extend the current filter framework to support such probability based filtering. So instead of having one base filter, we have a list of filters with weights. This way we can keep each filter very focused, and make them more reusable

liu-cong · 2025-02-28T18:01:14Z

pkg/epp/scheduling/scheduler.go

+	queueingThresholdLoRA = 128
+	// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
+	// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
+	loraAffinityThreshold = 0.999


do you have some insights to show why this is needed and why this value is picked?

I picked it after some trail and error. This value worked well when we had skewed traffic for different adapters, helped spread out high QPS adapters while keeping low QPS adapters less spread out

liu-cong · 2025-02-28T18:14:38Z

pkg/epp/backend/vllm/metrics.go

@@ -37,6 +37,7 @@ import (
 const (
 	LoraRequestInfoMetricName                = "vllm:lora_requests_info"
 	LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters"
+	LoraRequestInfoWaitingAdaptersMetricName = "waiting_lora_adapters"


On one hand, I can see why considering waiting is useful, because waiting loras are going to be served next. However, I have concerns of this weakening the lora affinity. running is bound by the max lora, waiting is not bound. If we enter an unstable state with a long waiting, we can lose the affinity benefit.

An improvement algorithm could be we prioritize waiting over running, what do you think?

so using waiting + running for affinity is always superior to using just running. Because adapters get served in First come first serve basis. So we know for sure thar waiting if not available will get loaded for sure. But yes, within waiting + running prioritizing waiting over running makes sense I think, but need to test first.

kaushikmitr · 2025-02-28T22:27:17Z

The algorithm is not using the waiting_lora_adapters metric, right?

It is, we are now checking for both waiting + running to determine affinity

kaushikmitr · 2025-03-01T00:47:29Z

/retest

ahg-g · 2025-03-01T02:28:16Z

docs/proposals/003-model-server-protocol/README.md

@@ -47,3 +47,5 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
  requested adapter. Example: `"max_lora": "8"`.
  * `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
    memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`
+  * `waiting_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
+    memory and ready to serve requests. Example: `"waiting_lora_adapters": "adapter1, adapter2"`


update the docs, this reads exactly the same as the running one

ahg-g · 2025-03-01T02:29:45Z

pkg/epp/backend/vllm/metrics.go

+
+		// Ignore metrics with both labels empty.
+		if running == "" && waiting == "" {
+			continue


Does this happen in practice? in what cases?

When there are no running requests, that is qps = 0 for the pod. In this case it's good to use last adapters that ran to determine affinity (as a proxy for Lora metric for what's already loaded)

ahg-g · 2025-03-01T02:43:21Z

pkg/epp/backend/vllm/metrics.go

+			}
+		}
+
+		// Ignore metrics with both labels empty.


Suggested change

// Ignore metrics with both labels empty.

// Ignore metrics with both labels empty. This happens when there are no running or waiting requests on

// the server, in this case it is best to use the last set of active adapters.

ahg-g · 2025-03-01T02:44:12Z

/approve
/hold

Thanks a lot, this is great, leaving it to Cong to lgtm.

k8s-ci-robot · 2025-03-01T02:44:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

scheduling changes for lora affinity load balancing

ad15e84

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2025

k8s-ci-robot requested review from Jeffwan and liu-cong February 27, 2025 21:37

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 27, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2025

ahg-g reviewed Feb 28, 2025

View reviewed changes

pkg/epp/scheduling/scheduler.go Show resolved Hide resolved

pkg/epp/backend/vllm/metrics.go Show resolved Hide resolved

liu-cong reviewed Feb 28, 2025

View reviewed changes

kaushikmitr added 3 commits March 1, 2025 00:28

refactor unit tests, address comments

b9f57c5

restore vllm deployment manifest

9e94fd9

update README for model server protocol to add waiting lora adapters

2b934d0

kaushikmitr added 2 commits March 1, 2025 00:57

remove unused variables

2d3a3bb

removed unused func

323e141

ahg-g reviewed Mar 1, 2025

View reviewed changes

fix model protocol readme

41ec5b8

ahg-g reviewed Mar 1, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2025

	// Ignore metrics with both labels empty.
	// Ignore metrics with both labels empty. This happens when there are no running or waiting requests on
	// the server, in this case it is best to use the last set of active adapters.

scheduling changes for lora affinity load balancing #423

Are you sure you want to change the base?

scheduling changes for lora affinity load balancing #423

Conversation

kaushikmitr commented Feb 27, 2025 • edited Loading

Scheduling Logic Enhancements:

Deployment Configuration Changes:

Metrics Collection Updates:

k8s-ci-robot commented Feb 27, 2025

netlify bot commented Feb 27, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Feb 27, 2025

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikmitr Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

kaushikmitr commented Feb 28, 2025

kaushikmitr commented Mar 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Mar 1, 2025

k8s-ci-robot commented Mar 1, 2025

kaushikmitr commented Feb 27, 2025 •

edited

Loading

netlify bot commented Feb 27, 2025 •

edited

Loading

ahg-g Feb 27, 2025 •

edited

Loading

kaushikmitr Feb 28, 2025 •

edited

Loading