Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make updates to fix issues with clang-cl builds while using AVX512 flags #10314

Merged
merged 2 commits into from
Nov 15, 2024

Conversation

Srihari-mcw
Copy link
Contributor

@Srihari-mcw Srihari-mcw commented Nov 15, 2024

  • The PR contains fixes to build llama.cpp with enhanced AVX512 flags (AVX512_BF16, AVX512_VNNI, AVX512_VBMI) with clang-cl and Visual Studio Generator
  • Issue : While trying to build with clang-cl and Visual Studio Generator, while trying to enable to AVX512_VBMI, AVX512_VNNI and AVX512_BF16 flags got the following error
    image
    image

Fix : Requires -mavx512vnni, -mavx512vbmi and -mavx512bf16 as part of compile options

Also,

  • Experiments were conducted to test llama.cpp with MSVC (Visual Studio Generator), Clang (Ninja) and Clang-Cl (Visual Studio Generator) and to compare the performance
  • Improvements were seen with the usage of Clang-Cl(VS Generator) and Clang (Ninja Generator) over MSVC Compiler
  • Both the ninja dependencies and clang dependencies were part of the Visual Studio Installations

Visual Studio Generator, Clang-Cl (17.0.3) - For which fixes are given

  • For building with Visual Studio Generator, Clang-Cl is obtained from the Visual Studio Installations Path : C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\Llvm\x64\bin
  • CMAKE Command : cmake -G "Visual Studio 17 2022" -T ClangCL -A x64 -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_AVX512_VBMI=ON -DGGML_AVX512_VNNI=ON -DGGML_AVX512_BF16=ON

Ninja Generator, Clang.exe/Clang++.exe (17.0.3)

  • Clang.exe and clang++.exe are obtained from the same path as clang-cl - C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\Llvm\x64\bin
  • CMAKE Command : cmake -G "Ninja" -S . -B Windows-build/ -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang.exe -DCMAKE_CXX_COMPILER=clang++.exe
  • Ninja path : C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja\ninja.exe and version (1.11.0)
  • Additional library depedencies are resolved with : set LIB=C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\um\x64;C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\lib\x64\uwp;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\x64 before cmake and build

Both the clang builds are done without the openmp dependencies

The tests were conducted in AMD Raphael 7600X which supports the following flags (The same were kept on in our windows tests) :
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|

Prompt Processing

model Compiler/Generator size params backend threads test t/s speedup Commit id
llama 7B Q4_0 MSVC Compiler/ VS Generator 3.56 GiB 6.74 B CPU 6 pp 512 48.14 ± 0.67 231f936
llama 7B Q4_0 clang-cl.exe/ VS Generator 3.56 GiB 6.74 B CPU 6 pp 512 50.23 ± 0.91 4.33% 231f936
llama 7B Q4_0 clang.exe/ Ninja Generator 3.56 GiB 6.74 B CPU 6 pp 512 55.69 ± 0.71 15.68% 231f936

Text Generation

model Compiler/Generator size params backend threads test t/s speedup Commit id
llama 7B Q4_0 MSVC Compiler/ VS Generator 3.56 GiB 6.74 B CPU 6 tg 128 14.8 ± 0.01 231f936
llama 7B Q4_0 clang-cl.exe/ VS Generator 3.56 GiB 6.74 B CPU 6 tg 128 14.95 ± 0.02 1.00% 231f936
llama 7B Q4_0 clang.exe/ Ninja Generator 3.56 GiB 6.74 B CPU 6 tg 128 14.95 ± 0.02 1.00% 231f936

Tests done with Meta Llama2 7B model

Having a clang windows build (especially with the Ninja Generator) as part of the release could help windows users. Authors/Maintainers of llama.cpp, could you please share your thoughts on the same? Thanks

ggml/src/ggml-cpu/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/CMakeLists.txt Outdated Show resolved Hide resolved
@slaren slaren merged commit 74d73dc into ggerganov:master Nov 15, 2024
54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
@Srihari-mcw
Copy link
Contributor Author

Having a clang windows build for x86/x64 (especially with the Ninja Generator) as part of the release could help windows users based on the testing above. @slaren, could you please share your thoughts on the same? Thanks

@slaren
Copy link
Collaborator

slaren commented Nov 20, 2024

Sure, but I wonder if performance would be better with gcc/mingw64.

@Srihari-mcw
Copy link
Contributor Author

Tested the repo with llama.cpp build with w64devkit gcc with make (Pls see table for commit details) and observed text generation is at a disadvantage with the build comparing that to results shared above

GCC 12.2

Prompt Processing

model Compiler size params backend threads test t/s Commit id
llama 7B Q4_0 GCC Compiler w64devkit 3.56 GiB 6.74 B CPU 6 pp 512 57.88 ± 0.89 [a5e4759] (a5e4759)

Text Generation

model GCC Compiler w64devkit size params backend threads test t/s Commit id
llama 7B Q4_0 MSVC Compiler/ VS Generator 3.56 GiB 6.74 B CPU 6 tg 128 10.31 ± 0.01 [a5e4759] (a5e4759)

@slaren
Copy link
Collaborator

slaren commented Dec 5, 2024

I think this is because the OpenMP implementation in MinGW does not perform very well. In my tests, MinGW with -DGGML_OPENMP=OFF performs the best. Additionally, both clang and msvc have issues building all the CPU backend variants due to missing intrinsics, e.g. AVX-VNNI can only be used with gcc. I will be updating the windows CI in the next days.

@Srihari-mcw
Copy link
Contributor Author

Hi @slaren ,

Post the latest changes with Q4_0 model, the model was tested with both clang and GCC (both openmp disabled) and observed that clang, for Q4_0, gives much better performance post these changes (GCC also has gains).

Prompt Processing

model Compiler size params backend threads test t/s speedup Commit id
llama 7B Q4_0 GCC 3.56 GiB 6.74 B CPU 6 pp 512 63.15 ± 1.07 64ae065
llama 7B Q4_0 Clang 3.56 GiB 6.74 B CPU 6 pp 512 72.66 ± 0.81 15.06% 64ae065

Text Generation

model Compiler size params backend threads test t/s speedup Commit id
llama 7B Q4_0 GCC 3.56 GiB 6.74 B CPU 6 tg 128 14.66 ± 0.06 64ae065
llama 7B Q4_0 Clang 3.56 GiB 6.74 B CPU 6 tg 128 14.62 ± 0.08 -0.27% 64ae065

Currently the build.yml script has provisions for windows build with msvc (Link) Can we also add x64-win-llvm as part of the same? Can you please share your insights on future plans for the CI/CD pipeline? Thanks

@slaren
Copy link
Collaborator

slaren commented Dec 13, 2024

My goal is to build with GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS to build the CPU backend and distribute a single package for all the architectures. Unfortunately, clang has issues with some instructions sets, and it cannot complete this build. Are you able to build with clang and -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON?

@Srihari-mcw
Copy link
Contributor Author

Hi @slaren ,

#11027

Pls check out the above PR - With the changes here we were able to build with clang without issues with the options -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON. Additionally we also kept Openmp off while testing the same. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants