How to get GPU memory footprints when using distributed inference? #491
-
I want to observe the GPU memory footprint of models when performing inferences. When I perform inferences on a single GPU, Should I use Any help would be appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @HermitSun, thanks for trying out vLLM and good question. When using multiple GPUs, vLLM creates 1 worker process per GPU. Thus, if you use 2 GPUs, there will be 3 processes in total and the process running your code will not directly use any GPU. To actually get the number, you will need to insert BTW, you can configure the vLLM's GPU memory usage via the |
Beta Was this translation helpful? Give feedback.
-
Thank you for your kind reply. After I insert code inside the Maybe we can provide some profiling hooks or decorators, if possible. |
Beta Was this translation helpful? Give feedback.
Hi @HermitSun, thanks for trying out vLLM and good question. When using multiple GPUs, vLLM creates 1 worker process per GPU. Thus, if you use 2 GPUs, there will be 3 processes in total and the process running your code will not directly use any GPU. To actually get the number, you will need to insert
torch.cuda.memory_allocated
inside theWorker
class or in the model code.BTW, you can configure the vLLM's GPU memory usage via the
gpu_memory_utilization
argument in theLLM
class.