How to Understand and Maximize the Use of max_num_batched_tokens #2509

zhanghx0905 · 2024-01-19T09:07:25Z

I am currently running the Yi34B model on four A10 (24G) GPUs, utilizing AWQ 4-bit quantization. My goal is to deploy the model in a production environment, maximizing memory utilization and throughput without overwhelming the system with high loads.

Through repeated experiments, I've found that setting max_num_batched_tokens seems to be the only way to meet my requirements.

When I set max_num_batched_tokens=50000, the GPU blocks are 758, and after model loaded, each GPU only utilizes about half of its available VRAM. If I increase this value further, the model fails to load.

{
    "swap_space": 4,
    "tensor_parallel_size": 4,
    "max_num_batched_tokens": 50000
}

When I set max_num_batched_tokens=20480, the GPU blocks are 4485. That's confusing and I wander the exact meaning of max_num_batched_tokens.

I'm seeking advice on how to balance maximizing throughput while ensuring the stability of the model under these circumstances. What would be the best approach to achieve this?

I'm using the latest version v0.2.7

The text was updated successfully, but these errors were encountered:

JasonZhu1313 · 2024-01-22T16:45:06Z

See my explanation in the other thread: #2492

zhanghx0905 · 2024-01-23T00:36:32Z

See my explanation in the other thread: #2492

So the key point is about balancing the prefill stage and the decoding stage? The larger the max_num_batched_tokens setting, the larger the batch size during the prefill stage, but the smaller the cache block left for the decoding stage, is that right? It seems this parameter is unrelated to the decoding stage?

I will try using a benchmark tool to test the performance of different options.
My application scenario is RAG, and the number of input tokens is generally large.

sir3mat · 2024-12-19T15:53:31Z

have you solved your issue? @zhanghx0905

See my explanation in the other thread: #2492

So the key point is about balancing the prefill stage and the decoding stage? The larger the max_num_batched_tokens setting, the larger the batch size during the prefill stage, but the smaller the cache block left for the decoding stage, is that right? It seems this parameter is unrelated to the decoding stage?

I will try using a benchmark tool to test the performance of different options. My application scenario is RAG, and the number of input tokens is generally large.

zhanghx0905 changed the title ~~Optimizing Yi34B Model Performance on A10 GPUs with AWQ 4-bit Quantization~~ How to Understand and Maximize the Use of max_num_batched_tokens Jan 20, 2024

hmellor closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Understand and Maximize the Use of max_num_batched_tokens #2509

How to Understand and Maximize the Use of max_num_batched_tokens #2509

zhanghx0905 commented Jan 19, 2024

JasonZhu1313 commented Jan 22, 2024

zhanghx0905 commented Jan 23, 2024

sir3mat commented Dec 19, 2024

How to Understand and Maximize the Use of max_num_batched_tokens #2509

How to Understand and Maximize the Use of max_num_batched_tokens #2509

Comments

zhanghx0905 commented Jan 19, 2024

JasonZhu1313 commented Jan 22, 2024

zhanghx0905 commented Jan 23, 2024

sir3mat commented Dec 19, 2024