Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Understand and Maximize the Use of max_num_batched_tokens #2509

Closed
zhanghx0905 opened this issue Jan 19, 2024 · 3 comments
Closed

How to Understand and Maximize the Use of max_num_batched_tokens #2509

zhanghx0905 opened this issue Jan 19, 2024 · 3 comments

Comments

@zhanghx0905
Copy link

I am currently running the Yi34B model on four A10 (24G) GPUs, utilizing AWQ 4-bit quantization. My goal is to deploy the model in a production environment, maximizing memory utilization and throughput without overwhelming the system with high loads.

Through repeated experiments, I've found that setting max_num_batched_tokens seems to be the only way to meet my requirements.

When I set max_num_batched_tokens=50000, the GPU blocks are 758, and after model loaded, each GPU only utilizes about half of its available VRAM. If I increase this value further, the model fails to load.

{
    "swap_space": 4,
    "tensor_parallel_size": 4,
    "max_num_batched_tokens": 50000
}

When I set max_num_batched_tokens=20480, the GPU blocks are 4485. That's confusing and I wander the exact meaning of max_num_batched_tokens.

I'm seeking advice on how to balance maximizing throughput while ensuring the stability of the model under these circumstances. What would be the best approach to achieve this?

I'm using the latest version v0.2.7

@zhanghx0905 zhanghx0905 changed the title Optimizing Yi34B Model Performance on A10 GPUs with AWQ 4-bit Quantization How to Understand and Maximize the Use of max_num_batched_tokens Jan 20, 2024
@JasonZhu1313
Copy link
Contributor

See my explanation in the other thread: #2492

@zhanghx0905
Copy link
Author

See my explanation in the other thread: #2492

So the key point is about balancing the prefill stage and the decoding stage? The larger the max_num_batched_tokens setting, the larger the batch size during the prefill stage, but the smaller the cache block left for the decoding stage, is that right? It seems this parameter is unrelated to the decoding stage?

I will try using a benchmark tool to test the performance of different options.
My application scenario is RAG, and the number of input tokens is generally large.

@hmellor hmellor closed this as completed Apr 4, 2024
@sir3mat
Copy link

sir3mat commented Dec 19, 2024

have you solved your issue? @zhanghx0905

See my explanation in the other thread: #2492

So the key point is about balancing the prefill stage and the decoding stage? The larger the max_num_batched_tokens setting, the larger the batch size during the prefill stage, but the smaller the cache block left for the decoding stage, is that right? It seems this parameter is unrelated to the decoding stage?

I will try using a benchmark tool to test the performance of different options. My application scenario is RAG, and the number of input tokens is generally large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants