[Feature]: Exit on failures #7633

pseudotensor · 2024-08-18T04:17:19Z

🚀 The feature, motivation and pitch

Commonly vllm will crash/fail at server/engine level, so no more requests can be made. However, while in some cases the entire server stops and (say in docker) one can auto restart to catch such rare failures, many times the failures leave the server in a mixed state and one cannot easily know if the server is healthy.

Sometimes even the health API says it's healthy, when it's not.

Alternatives

per-user crafted solutions to restart the server by detection of API behavior.

Additional context

Example case when left in ambiguous state when can't reach /v1 but server not shutdown: #7632

ywang96 · 2024-08-18T07:30:24Z

I think #6594 should have solved this issue. Have you observed this issue happen from the latest main?

pseudotensor · 2024-08-18T07:49:59Z

Oh cool, I didn't see that PR. I'll try.

joe-schwartz-certara · 2024-09-24T15:11:56Z

Im still observing the async engine dead error without the server going down. I don't think #6594 worked all the way. I'm using vllm 0.6.0 and a variety of models/gpu configurations. I still see the async engine dead error from time to time and I have to manually inspect the logs and manually restart to restore the prod environments.

pseudotensor added the feature request New feature or request label Aug 18, 2024

pseudotensor closed this as completed Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Exit on failures #7633

[Feature]: Exit on failures #7633

pseudotensor commented Aug 18, 2024

ywang96 commented Aug 18, 2024 •

edited

Loading

pseudotensor commented Aug 18, 2024

joe-schwartz-certara commented Sep 24, 2024

[Feature]: Exit on failures #7633

[Feature]: Exit on failures #7633

Comments

pseudotensor commented Aug 18, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

ywang96 commented Aug 18, 2024 • edited Loading

pseudotensor commented Aug 18, 2024

joe-schwartz-certara commented Sep 24, 2024

ywang96 commented Aug 18, 2024 •

edited

Loading