CUDA: fixed peer access toggle synchronization #4602

JohannesGaessler · 2023-12-22T14:33:54Z

While investigating #4594 I noticed that sometimes the data transfer between GPUs would throw an error but only for specific model sizes and combinations of input parameters. I noticed that the error can be fixed by disabling peer access.

The specific commands that are causing the error in my case:

export model_name=llama_2-13b && export quantization=q4_0
./perplexity --n-gpu-layers 99 --model models/opt/${model_name}-${quantization}.gguf --file wikitext-2-raw/wiki.test.raw --mlock --chunks 10

What I think is happening: when peer access is enabled or disabled the change is not instant nor does the CPU code wait for the change to take effect. If the change takes effect during a data transfer this causes an error. Adding usleep(100000) to wait 0.1 seconds after changing peer access fixes the error. Adding cudaDeviceSynchronize does not. In my specific case peer access gets first enabled for the warmup eval with a batch size of 2 and then disabled again for the perplexity eval with a batch size of 512.

Feedback or ideas for better solutions would be very welcome.

slaren · 2023-12-22T14:54:10Z

Peer access is only used for cudaMemcpy, right? If so, have you tried using cudaMemcpyPeerAsync instead of enabling peer access for everything?

JohannesGaessler · 2023-12-22T14:59:53Z

If I remember correctly cudaMemcpyPeerAsync had no effect on the actual performance compared to peer access being always disabled. In NVIDIA Nsight Systems the data transfers were also marked as the regular kind that you get with cudaMemcpyAsync by default if peer access is disabled.

slaren · 2023-12-22T15:15:58Z

What if peer access is only enabled just before the cudaMemcpy between devices, and disabled immediately afterwards? Ie. so no kernels are launched with peer access enabled.

slaren · 2023-12-22T15:29:25Z

This may be useful: https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/

Here’s where the new CUDA virtual memory management functions can help. The cuMemSetAccess function allows you to target specific allocations to peer map to a specific set of devices. While this still scales with the number of devices that access it, the common case of just one device remains O(lg(N)). In addition, you don’t need cudaEnablePeerAccess anymore, leaving cudaMalloc calls fast and paying the cost of the additional mappings only where needed.

Ie. you should be able to keep peer access disabled, and instead make certain memory allocations enabled for peer access.

JohannesGaessler · 2023-12-22T15:30:07Z

The issue isn't the kernel launches. Those only access the memory that is on-device and I think they are unaffected if the change takes place during their execution. That is why I didn't immediately notice the error. In most cases the change from peer access takes effect while a kernel is running and no data is being transferred. The error only occurs during the data transfer between devices. In any case, toggling peer access seems to require some degree of synchronization to make sure no data is currently being transferred. So toggling it immediately before and after memory transfers would kill performance and make it pointless.

Long term I want to make it so that the matrix multiplication kernels can write back the result as q8. That would save some time when converting the hidden state and I assume it would also make peer access universally faster because ~4x less data would need to be transferred.

slaren · 2023-12-22T15:32:47Z

What I take from this is that enabling and disabling peer access is a very expensive operation because all the memory allocations are mapped between devices. I imagine that the CUDA driver has some synchronization issue while doing this. However you can manage virtual memory yourself using the low level APIs, and that would allow more granular control.

JohannesGaessler · 2023-12-22T15:34:26Z

I'll look into it, thanks.

JohannesGaessler · 2023-12-22T16:51:23Z

I did an implementation that directly uses the CUDA driver API and only enables peer access for the scratch buffer. That did not fix the issue. I did the test with 2 P40s; it may be that the feature does not have hardware support for those GPUs.

slaren · 2023-12-22T16:58:09Z

Doesn't the issue happen when enabling or disabling peer access? If I understand correctly, this should avoid the need to enable peer access entirely.

JohannesGaessler · 2023-12-22T17:07:40Z

If you use the driver API directly you can avoid calls to cudaEnablePeerAccess and cudaDisablePeerAccess. But instead there is a call to cuMemSetAccess. And this call seems to have the same effect in terms of how long it takes, whether or not it's synchronous or asynchronous, and whether or not an error is thrown during a later memory transfer.

The blog post that you shared is from 2020. So I suspect that the ability to enable/disable peer access only for specific memory regions only actually exists in Ampere or newer. For older architectures it's probably being emulated in software. The underlying code for cuMemSetAccess and cudaEnablePeerAccess would then be the same.

JohannesGaessler requested a review from slaren December 22, 2023 14:34

CUDA: fixed peer access toggle synchronization

e692c2d

JohannesGaessler force-pushed the cuda-fix-peer-sync branch from 9474bab to e692c2d Compare December 25, 2023 13:23

This was referenced Dec 25, 2023

cuda : improve cuda pool efficiency using virtual memory #4606

Merged

cuda : fix vmm pool with multi GPU #4620

Merged

JohannesGaessler mentioned this pull request Oct 7, 2024

Bug: Row Split Mode - Segmentation fault after model load on ROCm multi-gpu #9761

Closed

JohannesGaessler mentioned this pull request Nov 4, 2024

cuda : clear error after changing peer access #10153

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fixed peer access toggle synchronization #4602

CUDA: fixed peer access toggle synchronization #4602

JohannesGaessler commented Dec 22, 2023 •

edited

Loading

slaren commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023

slaren commented Dec 22, 2023 •

edited

Loading

slaren commented Dec 22, 2023 •

edited

Loading

JohannesGaessler commented Dec 22, 2023

slaren commented Dec 22, 2023 •

edited

Loading

JohannesGaessler commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023

slaren commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023

CUDA: fixed peer access toggle synchronization #4602

Are you sure you want to change the base?

CUDA: fixed peer access toggle synchronization #4602

Conversation

JohannesGaessler commented Dec 22, 2023 • edited Loading

slaren commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023

slaren commented Dec 22, 2023 • edited Loading

slaren commented Dec 22, 2023 • edited Loading

JohannesGaessler commented Dec 22, 2023

slaren commented Dec 22, 2023 • edited Loading

JohannesGaessler commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023

slaren commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023

JohannesGaessler commented Dec 22, 2023 •

edited

Loading

slaren commented Dec 22, 2023 •

edited

Loading

slaren commented Dec 22, 2023 •

edited

Loading

slaren commented Dec 22, 2023 •

edited

Loading