-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Row Split Mode - Segmentation fault after model load on ROCm multi-gpu #9761
Comments
Did this work with a previous llama.cpp version? If so, with which commit did it stop working? |
@JohannesGaessler This is a recent machine, so I can't say. But it was working fine only if AMD's XGMI GPU interconnect link was working (that allows peer 2 peer GPU communication), at b3870. Now without XGMI (I have removed the bridge) at b3870 it does not work as well on that version. |
@JohannesGaessler I have a core dump, would it help? |
A core dump would probably not be of much use. If it worked with the physical link the problem likely has to do with peer access getting automatically enabled/disabled based on the HIP implementation of For debugging I would like you to try the following two edits to
In principle there could also be issues if multiple threads were to enter that function at the same time but to my knowledge that shouldn't be happening (@slaren correct me if I'm wrong). |
@JohannesGaessler The proposed fix did not work, forked and tried it on a fresh repo. I tried |
I don't know anything else to try. You can upload a core dump but realistically I don't think this will be fixed in the foreseeable future. AMD hardware is truth be told quite poorly supported. |
@JohannesGaessler I got a backtrace, not sure if you can interpret it: This is on a build at dca1d4b.
|
Which brings us to this line here: llama.cpp/ggml/src/ggml-cuda.cu Line 1354 in dca1d4b
🤔 |
I've encountered this issue before with multiple Radeon GPUs on Debian. That was caused by the lack of CONFIG_PCI_P2PDMA and CONFIG_HSA_AMD_P2P in kernel config. It's worth checking with something like rocm-bandwidth-test to see if PCIe P2P DMA is working properly. |
@hjc4869 Thanks for the hint - it resolved the issue at least for me. By the way, after update from rocm 6.0 to 6.2.1 I no longer had crash in llamacpp, but the models would just produce garbage with row split mode. Anyway, custom kernel build with CONFIG_HSA_AMD_P2P enabled resolved the issue completely. |
I can repro that locally with my 2*W7900DS setup. Sounds like newer ROCm having problems here, I guess it's not checking P2P DMA availability or not dealing with the lack of P2P correctly, and caused data corruption. Maybe we can document these hiccups for row splitting on ROCm somewhere to save some headaches for others. |
@hjc4869 thanks as well, I am now building a new kernel with the configs you have described. I will get back, I hope this will work! 🤞 |
Thanks @hjc4869, I have verified that it is working now. @hjc4869 Do you know how the PR should be done to document this? Is a new |
Glad to hear it's working. Haven't done contributions to this project before so I'm not sure about that, maybe we can find some examples in the already merged PR. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
I am running on Rocm with 4 x Instinct MI100.
Only when using
--split-mode row
mode I get a Address boundary error.llama.cpp was working when I had a XGMI GPU Bridge working with the 4 cards, but now the bridge is broken and am trying to run this only via PCIe.
My setup currrent passes Rocm validation suite.
Name and Version
llama-cli --version
version: 3889 (b6d6c52)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
llama-server --version
version: 3889 (b6d6c52)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: