Skip to content

How to get GPU memory footprints when using distributed inference? #491

Answered by WoosukKwon
HermitSun asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @HermitSun, thanks for trying out vLLM and good question. When using multiple GPUs, vLLM creates 1 worker process per GPU. Thus, if you use 2 GPUs, there will be 3 processes in total and the process running your code will not directly use any GPU. To actually get the number, you will need to insert torch.cuda.memory_allocated inside the Worker class or in the model code.

BTW, you can configure the vLLM's GPU memory usage via the gpu_memory_utilization argument in the LLM class.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by zhuohan123
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #410 on July 18, 2023 06:18.