[MISC] Add lora requests to metrics #9477

coolkp · 2024-10-17T22:39:30Z

This PR adds lora requests to log metrics. The metrics will be logged only when the lora is enabled. Here is an example:

# HELP vllm:lora_requests_info Running stats on lora requests waiting and under process.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="1",running_adapters="",waiting_adapters=""} 1.0

We plan to leverage this information for routing decisions in https://github.com/kubernetes-sigs/llm-instance-gateway

github-actions · 2024-10-17T22:39:43Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

Otherwise LGTM

vllm/engine/metrics.py

vllm/engine/llm_engine.py

comaniac · 2024-10-17T23:17:47Z

vllm/engine/llm_engine.py

+        max_lora_stat = "0"
+        if self.lora_config:
+            max_lora_stat = str(self.lora_config.max_loras)


This seems always fixed? In this case can we don't dump this value?

across multiple deployments its hard to get this value, helps determine how many loras can be fitted on the server. You are right, its definitely static right now, initialised at runtime and thats it. I considered moving it to separate info metric like the cache config info. But I think in future there maybe value in enabling dynamic adjustment of max lora, like base_model which is static right now.

comaniac

LGTM

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: charlifu <charlifu@amd.com>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: Alvant <alvasian@yandex.ru>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: qishuai <ferdinandzhong@gmail.com>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

The current `vllm:lora_requests_info` Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration. The label names used are: - running_lora_adapters: a per-adapter count of the number requests running using that adapter, formatted as a comma-separated string. - waiting_lora_adapters: similar, except counting requests that are waiting to be scheduled. - max_lora - the static "max number of LoRAs in a single batch." configuration. It looks like this: ``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead: ``` vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0 vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0 ``` This was added in vllm-project#9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

The current `vllm:lora_requests_info` Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is updated every iteration. The label names used are: - running_lora_adapters: a list of adapters with running requests, formatted as a comma-separated string. - waiting_lora_adapters: similar, except listing adapters with requests waiting to be scheduled. - max_lora - the static "max number of LoRAs in a single batch." configuration. It looks like this: ``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` I can't really make much sense of this. Encoding a running/waiting status for multiple adapters in a comma-separated string seems quite misguided - we should use labels to distinguish between per-adapter counts instead: ``` vllm:num_lora_requests_running{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0 vllm:num_lora_requests_waiting{lora_name="test-lora",model_name="meta-llama/Llama-3.1-8B-Instruct"} 7.0 ``` This was added in vllm-project#9477 and there is at least one known user. If we revisit this design and deprecate the old metric, we should reduce the need for a significant deprecation period by making the change in v0 also and asking this project to move to the new metric. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Kunjan Patel added 2 commits October 17, 2024 21:54

Add lora information metrics

80f57dc

Add lora information metrics

22758be

coolkp requested review from WoosukKwon, zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners October 17, 2024 22:39

Add lora information metrics formatting

1fae9e8

comaniac reviewed Oct 17, 2024

View reviewed changes

Kunjan Patel added 3 commits October 18, 2024 00:01

Add lora information metrics Resolve comments I

7541aa4

Add lora information metrics Resolve comments II

3b37cb4

Add lora information metrics Formatting

5e9418b

coolkp changed the title ~~[MISC] Add lora requestsd to metrics~~ [MISC] Add lora requests to metrics Oct 18, 2024

Formatting: sort imports

15e703b

coolkp requested a review from comaniac October 18, 2024 16:50

Kunjan Patel added 3 commits October 18, 2024 17:17

Add lora information metrics Formatting

5b63981

Add lora information metric, max-lora metric reason

c5ff381

Add lora information metric, max-lora metric reason

5bac26d

comaniac approved these changes Oct 18, 2024

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 18, 2024

comaniac enabled auto-merge (squash) October 18, 2024 18:51

comaniac merged commit 9bb10a7 into vllm-project:main Oct 18, 2024
70 of 71 checks passed

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[MISC] Add lora requests to metrics (vllm-project#9477)

2de931c

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal> Signed-off-by: charlifu <charlifu@amd.com>

liu-cong mentioned this pull request Nov 6, 2024

[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management #10086

Open

7 tasks

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[MISC] Add lora requests to metrics (vllm-project#9477)

a5360a3

Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal>

markmc mentioned this pull request Feb 14, 2025

[WIP][Metrics] Re-work approach to LoRA metrics #13303

Draft

5 tasks

markmc mentioned this pull request Feb 18, 2025

Consider re-working the vLLM Gauge exposing the currently active LoRAs kubernetes-sigs/gateway-api-inference-extension#354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Add lora requests to metrics #9477

[MISC] Add lora requests to metrics #9477

coolkp commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

comaniac left a comment

comaniac Oct 17, 2024

coolkp Oct 18, 2024

comaniac left a comment

[MISC] Add lora requests to metrics #9477

[MISC] Add lora requests to metrics #9477

Conversation

coolkp commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac Oct 17, 2024

Choose a reason for hiding this comment

coolkp Oct 18, 2024

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment