`max_num_batched_tokens` and `max_num_seqs values` #2492

isRambler · 2024-01-18T14:29:17Z

Hello, because I am new to vllm, I want to know how to set the max_num_batched_tokens and max_num_seqs values in order to achieve maximum inference performance. What is the relationship between max_num_batched_tokens and max_num_seqs? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistent

The text was updated successfully, but these errors were encountered:

JasonZhu1313 · 2024-01-18T16:38:08Z

max_num_batched_tokens and max_num_seqs essentially determines the batch size at prefill stage - the first time when the model performs inference to predict the next token in a sequence. vLLM utilizes continuous batching to achieve high throughput. The underlying scheduler determines, at a certain step, the batched samples for inference based on the state of GPU memory and the availability of KV Cache blocks. The scheduler separates queries into three categories: Waiting Queue (sequences at prefill stage), Running Queue (sequences at decoding stage), Swapping Queue (if cpu / gpu swapping is enabled).

I have drawn a diagram to better illustrate the scheduling workflow and you can see max_num_batched_tokens and max_num_seqs determines the batch size of waiting queue:

If you set the max_num_batched_tokens or max_num_seqs with low value then the prefill batch size will be small (e.g., 1) which might not hurt performance, there is no one size fit all suggestion I guess, I think you can tweak the prefill batch size through these two knobs and use benchmark_serving.py in vLLM to determine what size leads to best performance.

isRambler · 2024-01-22T09:32:32Z

max_num_batched_tokens and max_num_seqs essentially determines the batch size at prefill stage - the first time when the model performs inference to predict the next token in a sequence. vLLM utilizes continuous batching to achieve high throughput. The underlying scheduler determines, at a certain step, the batched samples for inference based on the state of GPU memory and the availability of KV Cache blocks. The scheduler separates queries into three categories: Waiting Queue (sequences at prefill stage), Running Queue (sequences at decoding stage), Swapping Queue (if cpu / gpu swapping is enabled).

I have drawn a diagram to better illustrate the scheduling workflow and you can see max_num_batched_tokens and max_num_seqs determines the batch size of waiting queue:

If you set the max_num_batched_tokens or max_num_seqs with low value then the prefill batch size will be small (e.g., 1) which might not hurt performance, there is no one size fit all suggestion I guess, I think you can tweak the prefill batch size through these two knobs and use benchmark_serving.py in vLLM to determine what size leads to best performance.

OK, thanks for the explanation

banyip · 2024-06-13T14:48:17Z

Hello, because I am new to vllm, I want to know how to set the max_num_batched_tokens and max_num_seqs values in order to achieve maximum inference performance. What is the relationship between max_num_batched_tokens and max_num_seqs? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistent

Is there a final answer to set this variable?

realJaydenCheng · 2024-07-25T03:45:25Z

Hello, because I am new to vllm, I want to know how to set the max_num_batched_tokens and max_num_seqs values in order to achieve maximum inference performance. What is the relationship between max_num_batched_tokens and max_num_seqs? Why do the output tokens appear when I set different max_num_batched_tokens and max_num_seqs? The totals may be inconsistent

Is there a final answer to set this variable?

You can set these variables when initializing the LLM object:

from vllm import LLM
llm = LLM(
    model=LLM_PATH,
    max_num_batched_tokens=512*50,
    max_model_len=512*50,
    gpu_memory_utilization=0.3,
)

ehartford · 2024-12-27T16:32:50Z

can we have some kind of guidance or rule of thumb? How do we decide, practically speaking, what values to set in order to maximize performance?

isRambler closed this as completed Jan 22, 2024

JasonZhu1313 mentioned this issue Jan 22, 2024

How to Understand and Maximize the Use of max_num_batched_tokens #2509

Closed

timxzz mentioned this issue May 20, 2024

[Usage]: How to determine whether the vllm engine is full with requests or not #3897

Closed

pseudotensor mentioned this issue Jul 25, 2024

[Usage]: The 8xH100 device failed to run meta-llama/Meta-Llama-3.1-405B-Instruct-FP8. #6746

Closed

AmericanPresidentJimmyCarter mentioned this issue Sep 30, 2024

Llama3.2 Vision Model: Guides and Issues #8826

Open

BoazTouitou97 mentioned this issue Oct 28, 2024

[Usage]: vLLM For maximally batched use case #9760

Open

1 task

alvin319 mentioned this issue Feb 25, 2025

Propagate vLLM batch size controls huggingface/lighteval#588

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`max_num_batched_tokens` and `max_num_seqs values` #2492

`max_num_batched_tokens` and `max_num_seqs values` #2492

isRambler commented Jan 18, 2024

JasonZhu1313 commented Jan 18, 2024

isRambler commented Jan 22, 2024

banyip commented Jun 13, 2024

realJaydenCheng commented Jul 25, 2024

ehartford commented Dec 27, 2024

max_num_batched_tokens and max_num_seqs values #2492

max_num_batched_tokens and max_num_seqs values #2492

Comments

isRambler commented Jan 18, 2024

JasonZhu1313 commented Jan 18, 2024

isRambler commented Jan 22, 2024

banyip commented Jun 13, 2024

realJaydenCheng commented Jul 25, 2024

ehartford commented Dec 27, 2024

`max_num_batched_tokens` and `max_num_seqs values` #2492

`max_num_batched_tokens` and `max_num_seqs values` #2492