You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We have some tools to monitor the system metrics like CPU, IO, Host Memory usage but not include the GPU metrics.
We want to understand the GPU memory usage during the scale test(related to #8811).
An easy way is to launch a script that calls nvidia-smi to capture the gpu metrics every second or more frequently:
while true; do nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv,noheader,nounits >> gpu_usage.csv; sleep 1; done
then we can further draw an image , map the usage to each query etc.
But the problem is RMM allocates the GPU memory beforehand, so nvidia-smi won't capture the actuall GPU memory used by our plugin, nvidia-smi will only see a full GPU memory usage from the very beginning to the end. I know we can disable the RMM pool to allow nvidia-smi works in this case but I doubt if that memory usage is identical to what when RMM ASYNC is enabled.
Need some more solution here.
The text was updated successfully, but these errors were encountered:
It looks like #6745 can satisfy our requirement. We are now trying to profile all queries in the Scale Test. Our use case is just to know the peak GPU memory when running a query. Thanks! @winningsix according to the issue mentioned by Matt, nsys now supports the track for GPU memory, all we need to do is to launch nsys to profile the spark application, and use nsys to open the output qdrep file, we should be able to see the nice metrics.
Is your feature request related to a problem? Please describe.
We have some tools to monitor the system metrics like CPU, IO, Host Memory usage but not include the GPU metrics.
We want to understand the GPU memory usage during the scale test(related to #8811).
An easy way is to launch a script that calls
nvidia-smi
to capture the gpu metrics every second or more frequently:then we can further draw an image , map the usage to each query etc.
But the problem is RMM allocates the GPU memory beforehand, so nvidia-smi won't capture the actuall GPU memory used by our plugin, nvidia-smi will only see a full GPU memory usage from the very beginning to the end. I know we can disable the RMM pool to allow nvidia-smi works in this case but I doubt if that memory usage is identical to what when RMM ASYNC is enabled.
Need some more solution here.
The text was updated successfully, but these errors were encountered: