Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_num_batched_tokens and max_num_seqs values #2492

Closed
isRambler opened this issue Jan 18, 2024 · 5 comments
Closed

max_num_batched_tokens and max_num_seqs values #2492

isRambler opened this issue Jan 18, 2024 · 5 comments

Comments

@isRambler
Copy link

Hello, because I am new to vllm, I want to know how to set the max_num_batched_tokens and max_num_seqs values in order to achieve maximum inference performance. What is the relationship between max_num_batched_tokens and max_num_seqs? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistent

@JasonZhu1313
Copy link
Contributor

max_num_batched_tokens and max_num_seqs essentially determines the batch size at prefill stage - the first time when the model performs inference to predict the next token in a sequence. vLLM utilizes continuous batching to achieve high throughput. The underlying scheduler determines, at a certain step, the batched samples for inference based on the state of GPU memory and the availability of KV Cache blocks. The scheduler separates queries into three categories: Waiting Queue (sequences at prefill stage), Running Queue (sequences at decoding stage), Swapping Queue (if cpu / gpu swapping is enabled).

I have drawn a diagram to better illustrate the scheduling workflow and you can see max_num_batched_tokens and max_num_seqs determines the batch size of waiting queue:

Flowchart (3)

If you set the max_num_batched_tokens or max_num_seqs with low value then the prefill batch size will be small (e.g., 1) which might not hurt performance, there is no one size fit all suggestion I guess, I think you can tweak the prefill batch size through these two knobs and use benchmark_serving.py in vLLM to determine what size leads to best performance.

@isRambler
Copy link
Author

max_num_batched_tokens and max_num_seqs essentially determines the batch size at prefill stage - the first time when the model performs inference to predict the next token in a sequence. vLLM utilizes continuous batching to achieve high throughput. The underlying scheduler determines, at a certain step, the batched samples for inference based on the state of GPU memory and the availability of KV Cache blocks. The scheduler separates queries into three categories: Waiting Queue (sequences at prefill stage), Running Queue (sequences at decoding stage), Swapping Queue (if cpu / gpu swapping is enabled).

I have drawn a diagram to better illustrate the scheduling workflow and you can see max_num_batched_tokens and max_num_seqs determines the batch size of waiting queue:

Flowchart (3)

If you set the max_num_batched_tokens or max_num_seqs with low value then the prefill batch size will be small (e.g., 1) which might not hurt performance, there is no one size fit all suggestion I guess, I think you can tweak the prefill batch size through these two knobs and use benchmark_serving.py in vLLM to determine what size leads to best performance.

OK, thanks for the explanation

@banyip
Copy link

banyip commented Jun 13, 2024

Hello, because I am new to vllm, I want to know how to set the max_num_batched_tokens and max_num_seqs values in order to achieve maximum inference performance. What is the relationship between max_num_batched_tokens and max_num_seqs? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistent

Is there a final answer to set this variable?

@realJaydenCheng
Copy link

Hello, because I am new to vllm, I want to know how to set the max_num_batched_tokens and max_num_seqs values in order to achieve maximum inference performance. What is the relationship between max_num_batched_tokens and max_num_seqs? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistent

Is there a final answer to set this variable?

You can set these variables when initializing the LLM object:

from vllm import LLM
llm = LLM(
    model=LLM_PATH,
    max_num_batched_tokens=512*50,
    max_model_len=512*50,
    gpu_memory_utilization=0.3,
)

@ehartford
Copy link

can we have some kind of guidance or rule of thumb? How do we decide, practically speaking, what values to set in order to maximize performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants