Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Moore Threads GPU #8383

Merged
merged 6 commits into from
Jul 27, 2024
Merged

feat: Support Moore Threads GPU #8383

merged 6 commits into from
Jul 27, 2024

Conversation

yeahdongcn
Copy link
Contributor

@yeahdongcn yeahdongcn commented Jul 9, 2024

Moore Threads, a cutting-edge GPU startup, introduces MUSA (Moore Threads Unified System Architecture) as its foundational technology. This pull request marks the initial integration of MTGPU support into llama.cpp, leveraging MUSA's capabilities to enhance LLM inference performance.

Similar to #1087, CUDA APIs are replaced by MUSA APIs using macros, and a new build option is added to Makefile and CMake.

# make
make GGML_MUSA=1

# CMake
cmake -B build -DGGML_MUSA=ON
cmake --build build --config Release

I also sent a PR to Ollama to integrate MTGPU to it and all the tests were performed through Ollama. Tested models are:

  • tinyllama:latest (1b)
  • llama3:latest (8b)
  • qwen2:72b

@github-actions github-actions bot added documentation Improvements or additions to documentation Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 9, 2024
@JohannesGaessler
Copy link
Collaborator

I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000?

@yeahdongcn
Copy link
Contributor Author

yeahdongcn commented Jul 9, 2024

I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000?

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if the MUSA changes were completely separate from the CUDA logic in the Makefile/cmakelists.txt. But I don't feel particularly strongly about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should refine this. But my initial thought is to mostly reuse the CUDA compiler settings and only change the different parts. This is because the CUDA compiler is already well-configured and I don't want to mess up with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After conducting some local testing, I found that if we separate out the MUSA changes from the CUDA logic in the Makefile/CMakeLists.txt, we would end up duplicating a significant amount of code. This would lead to maintenance issues down the line.
I'd love to explore alternative solutions but I'm not sure what the best way to proceed is. I'm open to suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I don't feel particularly strongly about this. But what I would like is a consistent implementation between HIP and MUSA. So if we take the original approach of this PR then we should also adjust the HIP logic (in a separate PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand.

Makefile Show resolved Hide resolved
ggml/src/ggml-cuda.cu Outdated Show resolved Hide resolved
ggml/src/ggml-cuda.cu Outdated Show resolved Hide resolved
ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved
src/llama.cpp Outdated Show resolved Hide resolved
@yeahdongcn yeahdongcn force-pushed the musa branch 10 times, most recently from dd64710 to bba0b94 Compare July 11, 2024 05:01
Makefile Outdated
CC := clang
CXX := clang++
GGML_CUDA := 1
GGML_NO_OPENMP := 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the issue with OpenMP is that the MUSA compiler does not work well it. Could the MUSA compiler be used only for the ggml-cuda files instead of completely disabling OpenMP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! The MUSA compiler is shipped with our Clang variant, which is based on Clang 14.

  1. Ubuntu 20.04 Requirement: MUSA libraries require Ubuntu 20.04.
  2. OpenMP Version: The latest OpenMP version available on Ubuntu 20.04 (Focal) is libomp-12-dev.

I'm not sure if setting GGML_NO_OPENMP to 0 will cause runtime issues. Therefore, I have set GML_NO_OPENMP to 1 to disable it. Once we upgrade to Ubuntu 22.04, we can re-enable it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, just saw this in my previous work log:

root@b7a6e65eec34:/ws# ldd llama-server 
        linux-vdso.so.1 (0x00007ffdff2de000)
        libmusa.so.1.0 => /usr/local/musa/lib/libmusa.so.1.0 (0x00007f044742e000)
        libmublas.so => /usr/local/musa/lib/libmublas.so (0x00007f043c1a0000)
        libmusart.so.1.0 => /usr/local/musa/lib/libmusart.so.1.0 (0x00007f043c16e000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f043c144000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f043c13e000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f043c134000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f043bf50000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f043be01000)
        libomp.so.5 => /lib/x86_64-linux-gnu/libomp.so.5 (0x00007f043bcff000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f043bce4000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f043baf2000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f04477a9000)
        libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007f043bad6000)
        libsrv_um_MUSA.so => /ddk/usr/lib/x86_64-linux-gnu/musa/libsrv_um_MUSA.so (0x00007f043b1bf000)
        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f043b1b2000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f043b196000)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will work without OpenMP, but the performance will be significantly worse in some scenarios such as -nkvo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand. I can perform some tests in the runtime env and check if we can enable it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to confirm: will llama.cpp accept a hardcoded path like this?

diff --git a/Makefile b/Makefile
index d2d9ce0f..4d10bd74 100644
--- a/Makefile
+++ b/Makefile
@@ -530,7 +530,6 @@ ifdef GGML_MUSA
        CC := clang
        CXX := clang++
        GGML_CUDA := 1
-       GGML_NO_OPENMP := 1
        MK_CPPFLAGS += -DGGML_USE_MUSA
 endif
 
@@ -589,8 +588,8 @@ ifdef GGML_CUDA
                        CUDA_PATH ?= /usr/local/musa
                endif
 
-               MK_CPPFLAGS  += -DGGML_USE_CUDA -I$(CUDA_PATH)/include
-               MK_LDFLAGS   += -lmusa -lmublas -lmusart -lpthread -ldl -lrt -L$(CUDA_PATH)/lib -L/usr/lib64
+               MK_CPPFLAGS  += -DGGML_USE_CUDA -I$(CUDA_PATH)/include -I/usr/lib/llvm-12/lib/clang/12.0.0/include
+               MK_LDFLAGS   += -lmusa -lmublas -lmusart -lpthread -ldl -lrt -L$(CUDA_PATH)/lib -L/usr/lib64 -L/usr/lib/llvm-12/lib
                MK_NVCCFLAGS += -x musa -mtgpu --cuda-gpu-arch=mp_22
        else
                ifneq ('', '$(wildcard /opt/cuda)')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's always the path for the MUSA compiler that may be ok, it depends entirely on the compiler, not on llama.cpp. If it is not, it may cause issues when other people try to build with MUSA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to the default libomp-dev in the builder image and libomp5-10 in the runtime image.
Now GGML_NO_OPENMP := 1 is removed from Makefile (as well as set(GGML_OPENMP OFF) in CMake) and everything looks good in my local tests.
Please see the latest commits. Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the .cu files are being compiled with the mcc compiler, so why change the C/C++ compiler to clang for the rest of the project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcc is actually a link to clang.

root@9627f6e94585:/ws# ll /usr/local/musa/bin/mcc
lrwxrwxrwx 1 root root 5 Jul 11 13:16 /usr/local/musa/bin/mcc -> clang*

The version of clang ships with MUSA, contains some extra modifications.

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
ggml/src/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/src/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/src/ggml-cuda.cu Outdated Show resolved Hide resolved
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 13, 2024
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@yeahdongcn yeahdongcn force-pushed the musa branch 3 times, most recently from bf41f17 to 0fc2a35 Compare July 15, 2024 11:16
@yeahdongcn
Copy link
Contributor Author

Eventually we should move all the HIP and MUSA-specific code to its own headers.

No problem. I can start working on this.

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@JohannesGaessler
Copy link
Collaborator

In an earlier post you said:

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Have there been any updates on this?

@yeahdongcn
Copy link
Contributor Author

yeahdongcn commented Jul 25, 2024

In an earlier post you said:

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Have there been any updates on this?

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.

The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

@slaren slaren merged commit e54c35e into ggerganov:master Jul 27, 2024
53 checks passed
@1823616178
Copy link

make error

ubuntu 20.04.06 LTS
musa driver 2.7.0
SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao

make GGML_MUSA=1
image
Please help me.

@1823616178
Copy link

My video card is s80,cpu is amd 2600

@yeahdongcn
Copy link
Contributor Author

make error

ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao

make GGML_MUSA=1 image Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.

The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Aug 2, 2024
* Update doc for MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA => MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix CI build failure

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@XenoAmess
Copy link

make error
ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao
make GGML_MUSA=1 image Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.
The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

any progress in s80?

@1823616178
Copy link

make error
ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao
make GGML_MUSA=1 image Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.
The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

any progress in s80?

I guess we'll have to wait for the next version of the SDK

@yeahdongcn
Copy link
Contributor Author

I guess we'll have to wait for the next version of the SDK

Yes, please give us more time.

@Ivening
Copy link

Ivening commented Sep 19, 2024

@yeahdongcn please help - have got the problem with compiling llama.cpp using MUSA SDK rc2.0.0 with Ubuntu 20.04.6 LTS:
running make GGML_MUSA=1 shows the following error:
Screenshot 2024-09-19 141040
Is there something I am doing wrong for the compilation?

@yeahdongcn
Copy link
Contributor Author

@Ivening We are still working on MTT S80 support, please see: #9526

If you are interested in running llama.cpp on MTT S80, please add me through WeChat: yeahdongcn.

@Ivening
Copy link

Ivening commented Sep 19, 2024

@yeahdongcn thank you for your reply! Will this code work with MTT S3000?

@yeahdongcn
Copy link
Contributor Author

@yeahdongcn thank you for your reply! Will this code work with MTT S3000?

Haha, it seems that you're one of our business customers! MTT S3000 shares the same architecture as MTT S80, I can test on MTT S3000 as well.

@yeahdongcn yeahdongcn mentioned this pull request Sep 23, 2024
6 tasks
@arch-btw
Copy link
Contributor

@yeahdongcn what speeds can we expect for ~8B models for the MTT S80?

@yeahdongcn
Copy link
Contributor Author

yeahdongcn commented Oct 22, 2024

@yeahdongcn what speeds can we expect for ~8B models for the MTT S80?

~15 tokens/s (llama3.1:8b)

Please also see the recording on llama3.2:1b: asciicast

@arch-btw
Copy link
Contributor

@yeahdongcn very good, thank you

@yeahdongcn yeahdongcn deleted the musa branch October 22, 2024 22:59
@yeahdongcn yeahdongcn restored the musa branch October 22, 2024 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants