Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: binary search for best context length avoiding oom #705

Merged
merged 8 commits into from
Nov 20, 2024

Conversation

zhuangqh
Copy link
Collaborator

@zhuangqh zhuangqh commented Nov 19, 2024

Reason for Change:

fix: binary search for best context length avoiding oom

Issue Fixed:

find_max_available_seq_len runs to oom when running
on the V100 16GB gpu with 128K context.

Notes for Reviewers:

In the worst case, it costs about 1minutes to find the best
length (running with phi3 medium model and 128k search space).

We set the context length to a safe value to avoid oom.
If the serving server receives a request which token length is longer
than max_model_len, server will reject this request.

example error message: This model's maximum context length is 2 tokens. However, you requested 19 tokens (9 in the messages, 10 in the completion). Please reduce the length of the messages or completion.

Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
Signed-off-by: jerryzhuang <zhuangqhc@gmail.com>
@zhuangqh zhuangqh merged commit 1517106 into kaito-project:main Nov 20, 2024
4 of 8 checks passed
@zhuangqh zhuangqh mentioned this pull request Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants