-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_num_batched_tokens
and max_num_seqs values
#2492
Comments
I have drawn a diagram to better illustrate the scheduling workflow and you can see If you set the |
OK, thanks for the explanation |
Is there a final answer to set this variable? |
You can set these variables when initializing the LLM object: from vllm import LLM
llm = LLM(
model=LLM_PATH,
max_num_batched_tokens=512*50,
max_model_len=512*50,
gpu_memory_utilization=0.3,
) |
can we have some kind of guidance or rule of thumb? How do we decide, practically speaking, what values to set in order to maximize performance? |
Hello, because I am new to vllm, I want to know how to set the
max_num_batched_tokens
andmax_num_seqs values
in order to achieve maximum inference performance. What is the relationship betweenmax_num_batched_tokens
andmax_num_seqs
? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistentThe text was updated successfully, but these errors were encountered: