-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Later version have degradation based on vllm:time_to_first_token_seconds_sum
metric
#8819
Comments
Hey @oandreeva-nv! This looks strange.
However, for your
I would suggest to try again with Hope it helps.😎 |
We bisected the codebase and found that version 0.5.3 had an inaccurate metric calculation. And that's why after bug is fixed, you observed a significant increase in metric time. This is the "culprit":
It seems there is a related bug, please see: #6337. We also confirmed that the results from version 0.4.2 were similar to those after version 0.5.4. Could you please help verify if this behavior aligns with your expectations? @oandreeva-nv
|
Thanks @elfiegg for finding this, yes, I believe I'm all set now and I've closed the issue |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
I've noticed a degradation after vllm v0.5.3.post1. For example for a simple model
facebook/opt-125m
start server with:send a request and query metrics:
Now, same process with vllm version
0.6.1.post2
gives these metrics:Slowdown is quite significant in my understanding at least according to
time_to_first_token_sum
, i.e.9.322166442871094e-05
vs0.034735918045043945
. Any recommendation on this?[Edit 1] To send a request I used curl:
and to query metrics:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: