Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats #10721

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Dec 8, 2024

At least on latest mesa ANV (24.3.1) this improves pp speed on Intel, so it seems that XMX engines are working in some way. Performance is still lower than expected.

Maybe this fixes the coopmat issue with AMD on Windows? I'll leave it on draft while we figure out how all the hardware/driver/OS combinations react to this.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Dec 8, 2024

Quick check showed that latest mesa fixed the low performance I saw on my Intel A770 regardless of this subgroup change.

Edit: However, newer mesa also reduces tg performance by 25%..

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 8, 2024
@easyfab
Copy link

easyfab commented Dec 8, 2024

Intel's Mesa drivers are really a complete mess!

With A770:

Master + mesa (24.0.9-0ubuntu0.2):

| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         pp512 |        142.42 ± 0.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         54.51 ± 0.04 |

PR + mesa (24.0.9-0ubuntu0.2):

| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         pp512 |         70.36 ± 0.05 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         54.57 ± 0.08 |

PR + mesa kisak/kisak-mesa (24.3.1)

| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         pp512 |        373.20 ± 1.03 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |         tg128 |         37.26 ± 0.02 |

PR + mesa oibaf/graphics-drivers ( git 25.0 )

ggml_vulkan: Compiling shaders..........................Done!
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost

and for comparison the results with sycl ( F16 ON) :

| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |         pp512 |        926.04 ± 0.27 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  99 |         tg128 |         40.28 ± 0.03 |

@0cc4m
Copy link
Collaborator Author

0cc4m commented Dec 8, 2024

Yeah, that's close to what I saw as well. Thank you for testing it.

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Dec 9, 2024

Hi @0cc4m , Is it possible to set warp size to 16 for Intel GPUs?

Not sure why it hangs the GPU and crashing here:

ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat
[....]
MUL_MAT(type_a=f16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
Aborted (core dumped)

@0cc4m
Copy link
Collaborator Author

0cc4m commented Dec 9, 2024

Yeah, I think that might be possible, but the crash is some other problem. Even m=1 works, the coopmat simply gets zeropadded in that case.

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Dec 9, 2024

Hmm. It's failing on n = 2 and passing on n = 1.
Also, I did not see any activity on the compute engine. The 3D shader engine gets filled up to 100 before the crash.

Edit: It is also crashing at model warmup!

llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Compiling shaders..........................Done!
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:      Vulkan0 model buffer size =  4986.92 MiB
llm_load_tensors:  Vulkan_Host model buffer size =  1125.00 MiB
.................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =   504.50 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
Aborted (core dumped)

@0cc4m
Copy link
Collaborator Author

0cc4m commented Dec 9, 2024

Hmm. It's failing on n = 2 and passing on n = 1. Also, I did not see any activity on the compute engine. The 3D shader engine gets filled up to 100 before the crash.

Edit: It is also crashing at model warmup!

n = 2 uses coopmats, n = 1 does not. Whether intel_gpu_top puts it in compute or 3d shader bar doesn't matter, don't trust it. It's the same hardware.

Model warmup uses n = 2, so it's the same kind of crash.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-subgroup-size-control branch from 595c1a7 to 239927a Compare December 12, 2024 07:41
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-subgroup-size-control branch from 239927a to 9131c59 Compare December 12, 2024 07:46
@0cc4m
Copy link
Collaborator Author

0cc4m commented Dec 12, 2024

I added a few more fixes for coopmat support and disabled it again for now on Intel and non-Mesa AMD. I've made some progress on supporting them, but it's not yet ready and I don't want to delay #10665 further.

@0cc4m 0cc4m marked this pull request as ready for review December 12, 2024 07:57
@0cc4m 0cc4m requested a review from jeffbolznv December 12, 2024 07:57
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I won't have a chance to test it today, but please go ahead.

@0cc4m 0cc4m merged commit dc5301d into master Dec 12, 2024
47 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-subgroup-size-control branch December 12, 2024 17:35
0cc4m pushed a commit that referenced this pull request Dec 13, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires #10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Dec 17, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov/llama.cpp#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Dec 17, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov/llama.cpp#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this pull request Dec 17, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov/llama.cpp#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this pull request Dec 18, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov/llama.cpp#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…oups for coopmats (ggerganov#10721)

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* Fix subgroup size control extension support check

Add accf32 and accf16 checks for coopmats

* Also disable coopmats on amdvlk
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
github-actions bot pushed a commit to martin-steinegger/ProstT5-llama that referenced this pull request Dec 30, 2024
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires ggerganov/llama.cpp#10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants