-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Understand and Maximize the Use of max_num_batched_tokens #2509
Comments
See my explanation in the other thread: #2492 |
So the key point is about balancing the prefill stage and the decoding stage? The larger the max_num_batched_tokens setting, the larger the batch size during the prefill stage, but the smaller the cache block left for the decoding stage, is that right? It seems this parameter is unrelated to the decoding stage? I will try using a benchmark tool to test the performance of different options. |
have you solved your issue? @zhanghx0905
|
I am currently running the Yi34B model on four A10 (24G) GPUs, utilizing AWQ 4-bit quantization. My goal is to deploy the model in a production environment, maximizing memory utilization and throughput without overwhelming the system with high loads.
Through repeated experiments, I've found that setting
max_num_batched_tokens
seems to be the only way to meet my requirements.When I set max_num_batched_tokens=50000, the GPU blocks are 758, and after model loaded, each GPU only utilizes about half of its available VRAM. If I increase this value further, the model fails to load.
When I set max_num_batched_tokens=20480, the GPU blocks are 4485. That's confusing and I wander the exact meaning of
max_num_batched_tokens
.I'm seeking advice on how to balance maximizing throughput while ensuring the stability of the model under these circumstances. What would be the best approach to achieve this?
I'm using the latest version v0.2.7
The text was updated successfully, but these errors were encountered: