fix: binary search for best context length avoiding oom #705

zhuangqh · 2024-11-19T04:55:06Z

Reason for Change:

fix: binary search for best context length avoiding oom

Issue Fixed:

find_max_available_seq_len runs to oom when running
on the V100 16GB gpu with 128K context.

Notes for Reviewers:

In the worst case, it costs about 1minutes to find the best
length (running with phi3 medium model and 128k search space).

We set the context length to a safe value to avoid oom.
If the serving server receives a request which token length is longer
than max_model_len, server will reject this request.

example error message: This model's maximum context length is 2 tokens. However, you requested 19 tokens (9 in the messages, 10 in the completion). Please reduce the length of the messages or completion.

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

presets/inference/vllm/inference_api.py

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

presets/inference/vllm/tests/test_vllm_inference_api.py

presets/inference/vllm/inference_api.py

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

fix: binary search for best context len avoiding oom

7dfc9ca

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

zhuangqh requested review from Fei-Guo, helayoty and ishaansehgal99 as code owners November 19, 2024 04:55

zhuangqh commented Nov 19, 2024

View reviewed changes

presets/inference/vllm/inference_api.py Show resolved Hide resolved

ishaansehgal99 approved these changes Nov 19, 2024

View reviewed changes

ishaansehgal99 reviewed Nov 19, 2024

View reviewed changes

presets/inference/vllm/inference_api.py Show resolved Hide resolved

limited step

56064c4

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

zhuangqh had a problem deploying to unit-tests November 19, 2024 09:59 — with GitHub Actions Failure

fix test

ab7ca17

zhuangqh temporarily deployed to unit-tests November 19, 2024 12:34 — with GitHub Actions Inactive

polish log

c113f96

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

zhuangqh temporarily deployed to unit-tests November 19, 2024 22:32 — with GitHub Actions Inactive

ishaansehgal99 reviewed Nov 19, 2024

View reviewed changes

presets/inference/vllm/tests/test_vllm_inference_api.py Show resolved Hide resolved

ishaansehgal99 reviewed Nov 19, 2024

View reviewed changes

presets/inference/vllm/inference_api.py Show resolved Hide resolved

ishaansehgal99 approved these changes Nov 19, 2024

View reviewed changes

Fei-Guo approved these changes Nov 19, 2024

View reviewed changes

zhuangqh added 3 commits November 20, 2024 10:38

add document

ba975dd

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

nit

4eff17e

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

nit

4a7b8c7

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

zhuangqh temporarily deployed to unit-tests November 19, 2024 23:48 — with GitHub Actions Inactive

nit

6f2cbce

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>

zhuangqh temporarily deployed to unit-tests November 19, 2024 23:58 — with GitHub Actions Inactive

zhuangqh requested a review from ishaansehgal99 November 20, 2024 00:06

ishaansehgal99 approved these changes Nov 20, 2024

View reviewed changes

zhuangqh merged commit 1517106 into kaito-project:main Nov 20, 2024
4 of 8 checks passed

zhuangqh mentioned this pull request Nov 26, 2024

Support vllm runtime #608

Closed

zhuangqh had a problem deploying to preset-env December 19, 2024 23:58 — with GitHub Actions Failure

zhuangqh had a problem deploying to unit-tests December 19, 2024 23:58 — with GitHub Actions Failure

zhuangqh had a problem deploying to e2e-test December 19, 2024 23:58 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: binary search for best context length avoiding oom #705

fix: binary search for best context length avoiding oom #705

zhuangqh commented Nov 19, 2024 •

edited

Loading

fix: binary search for best context length avoiding oom #705

fix: binary search for best context length avoiding oom #705

Conversation

zhuangqh commented Nov 19, 2024 • edited Loading

zhuangqh commented Nov 19, 2024 •

edited

Loading