Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: fixed peer access toggle synchronization #4602

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Dec 22, 2023

While investigating #4594 I noticed that sometimes the data transfer between GPUs would throw an error but only for specific model sizes and combinations of input parameters. I noticed that the error can be fixed by disabling peer access.

The specific commands that are causing the error in my case:

export model_name=llama_2-13b && export quantization=q4_0
./perplexity --n-gpu-layers 99 --model models/opt/${model_name}-${quantization}.gguf --file wikitext-2-raw/wiki.test.raw --mlock --chunks 10

What I think is happening: when peer access is enabled or disabled the change is not instant nor does the CPU code wait for the change to take effect. If the change takes effect during a data transfer this causes an error. Adding usleep(100000) to wait 0.1 seconds after changing peer access fixes the error. Adding cudaDeviceSynchronize does not. In my specific case peer access gets first enabled for the warmup eval with a batch size of 2 and then disabled again for the perplexity eval with a batch size of 512.

Feedback or ideas for better solutions would be very welcome.

@slaren
Copy link
Collaborator

slaren commented Dec 22, 2023

Peer access is only used for cudaMemcpy, right? If so, have you tried using cudaMemcpyPeerAsync instead of enabling peer access for everything?

@JohannesGaessler
Copy link
Collaborator Author

If I remember correctly cudaMemcpyPeerAsync had no effect on the actual performance compared to peer access being always disabled. In NVIDIA Nsight Systems the data transfers were also marked as the regular kind that you get with cudaMemcpyAsync by default if peer access is disabled.

@slaren
Copy link
Collaborator

slaren commented Dec 22, 2023

What if peer access is only enabled just before the cudaMemcpy between devices, and disabled immediately afterwards? Ie. so no kernels are launched with peer access enabled.

@slaren
Copy link
Collaborator

slaren commented Dec 22, 2023

This may be useful: https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

Here’s where the new CUDA virtual memory management functions can help. The cuMemSetAccess function allows you to target specific allocations to peer map to a specific set of devices. While this still scales with the number of devices that access it, the common case of just one device remains O(lg(N)). In addition, you don’t need cudaEnablePeerAccess anymore, leaving cudaMalloc calls fast and paying the cost of the additional mappings only where needed.

Ie. you should be able to keep peer access disabled, and instead make certain memory allocations enabled for peer access.

@JohannesGaessler
Copy link
Collaborator Author

The issue isn't the kernel launches. Those only access the memory that is on-device and I think they are unaffected if the change takes place during their execution. That is why I didn't immediately notice the error. In most cases the change from peer access takes effect while a kernel is running and no data is being transferred. The error only occurs during the data transfer between devices. In any case, toggling peer access seems to require some degree of synchronization to make sure no data is currently being transferred. So toggling it immediately before and after memory transfers would kill performance and make it pointless.

Long term I want to make it so that the matrix multiplication kernels can write back the result as q8. That would save some time when converting the hidden state and I assume it would also make peer access universally faster because ~4x less data would need to be transferred.

@slaren
Copy link
Collaborator

slaren commented Dec 22, 2023

What I take from this is that enabling and disabling peer access is a very expensive operation because all the memory allocations are mapped between devices. I imagine that the CUDA driver has some synchronization issue while doing this. However you can manage virtual memory yourself using the low level APIs, and that would allow more granular control.

@JohannesGaessler
Copy link
Collaborator Author

I'll look into it, thanks.

@JohannesGaessler
Copy link
Collaborator Author

I did an implementation that directly uses the CUDA driver API and only enables peer access for the scratch buffer. That did not fix the issue. I did the test with 2 P40s; it may be that the feature does not have hardware support for those GPUs.

@slaren
Copy link
Collaborator

slaren commented Dec 22, 2023

Doesn't the issue happen when enabling or disabling peer access? If I understand correctly, this should avoid the need to enable peer access entirely.

@JohannesGaessler
Copy link
Collaborator Author

If you use the driver API directly you can avoid calls to cudaEnablePeerAccess and cudaDisablePeerAccess. But instead there is a call to cuMemSetAccess. And this call seems to have the same effect in terms of how long it takes, whether or not it's synchronous or asynchronous, and whether or not an error is thrown during a later memory transfer.

The blog post that you shared is from 2020. So I suspect that the ability to enable/disable peer access only for specific memory regions only actually exists in Ampere or newer. For older architectures it's probably being emulated in software. The underlying code for cuMemSetAccess and cudaEnablePeerAccess would then be the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants