-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: fixed peer access toggle synchronization #4602
base: master
Are you sure you want to change the base?
CUDA: fixed peer access toggle synchronization #4602
Conversation
Peer access is only used for |
If I remember correctly |
What if peer access is only enabled just before the |
This may be useful: https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/
Ie. you should be able to keep peer access disabled, and instead make certain memory allocations enabled for peer access. |
The issue isn't the kernel launches. Those only access the memory that is on-device and I think they are unaffected if the change takes place during their execution. That is why I didn't immediately notice the error. In most cases the change from peer access takes effect while a kernel is running and no data is being transferred. The error only occurs during the data transfer between devices. In any case, toggling peer access seems to require some degree of synchronization to make sure no data is currently being transferred. So toggling it immediately before and after memory transfers would kill performance and make it pointless. Long term I want to make it so that the matrix multiplication kernels can write back the result as q8. That would save some time when converting the hidden state and I assume it would also make peer access universally faster because ~4x less data would need to be transferred. |
What I take from this is that enabling and disabling peer access is a very expensive operation because all the memory allocations are mapped between devices. I imagine that the CUDA driver has some synchronization issue while doing this. However you can manage virtual memory yourself using the low level APIs, and that would allow more granular control. |
I'll look into it, thanks. |
I did an implementation that directly uses the CUDA driver API and only enables peer access for the scratch buffer. That did not fix the issue. I did the test with 2 P40s; it may be that the feature does not have hardware support for those GPUs. |
Doesn't the issue happen when enabling or disabling peer access? If I understand correctly, this should avoid the need to enable peer access entirely. |
If you use the driver API directly you can avoid calls to The blog post that you shared is from 2020. So I suspect that the ability to enable/disable peer access only for specific memory regions only actually exists in Ampere or newer. For older architectures it's probably being emulated in software. The underlying code for |
9474bab
to
e692c2d
Compare
While investigating #4594 I noticed that sometimes the data transfer between GPUs would throw an error but only for specific model sizes and combinations of input parameters. I noticed that the error can be fixed by disabling peer access.
The specific commands that are causing the error in my case:
What I think is happening: when peer access is enabled or disabled the change is not instant nor does the CPU code wait for the change to take effect. If the change takes effect during a data transfer this causes an error. Adding
usleep(100000)
to wait 0.1 seconds after changing peer access fixes the error. AddingcudaDeviceSynchronize
does not. In my specific case peer access gets first enabled for the warmup eval with a batch size of 2 and then disabled again for the perplexity eval with a batch size of 512.Feedback or ideas for better solutions would be very welcome.