vulkan: select only one device for single gpu with multiple drivers #7582

Adriankhl · 2024-05-28T02:53:47Z

An issue which an user encountered using my library: Adriankhl/godot-llm#14

Sometimes, you have multiple drivers installed for the same GPU, e.g.

And in vulkan, the GPU will be listed as separated physical devices. It causes an issue when llama.cpp attempts to split the layer.

One workaround is to set split mode to "none", which didn't work before but I fixed it here: #7552

This PR fixes the issue by checking the deviceID, and if there are two devices with the same ID, pick the one with a larger heap size.

I don't have a discrete GPU myself so this needs to be tested by someone else with a discrete GPU and multiple drivers for the same GPU.

The device selection criteria is also a bit naive, if anyone has a better suggestion please let me know.

0cc4m · 2024-05-28T17:43:54Z

I don't think picking by heap size will do anything, since that's a hardware feature any driver should report identical numbers there.

Currently you can handle this case manually by setting the devices to be used with GGML_VK_VISIBLE_DEVICES or by keeping only a single driver installed (which I would recommend, generally)

I know only of Linux where AMD has 3 available drivers (RADV, AMDVLK and whatever the pro Vulkan driver is called), and Nvidia has their proprietary driver and the new NVK one that is being developed. And now there's Windows (which is the one reported on your repo), with their Vulkan to DX12 thing. I don't know if that can be disabled.

I would write the logic like this: In the AMD case, you should pick RADV. In the Nvidia case, you should pick the proprietary driver. In any other case, you could pick the first available one.

What do you think?

Adriankhl · 2024-05-29T02:13:08Z

@0cc4m Make sense. I get a list of driver names here: https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.2&name=driverName&platform=all

And this is my current priorities, smaller numbers are prioritized

                    std::map<std::string, int> driver_priorities {};
                    driver_priorities["NVIDIA"] = 1;
                    driver_priorities["nvk"] = 2;
                    driver_priorities["radv"] = 3;
                    driver_priorities["AMD open-source driver"] = 4;
                    driver_priorities["AMD proprietary driver"] = 5;
                    driver_priorities["Intel open-source Mesa driver"] = 6;
                    driver_priorities["Intel Corporation"] = 7;

0cc4m · 2024-06-02T06:48:44Z

That's better, but I would prefer if we avoided the string comparison here. I propose to use the data in VkPhysicalDeviceDriverProperties. We can use the device vendor ID similar to here and afterwards select by the driver ID.

That should simplify the code.

Also, please don't write to stdout. stderr is sometimes okay, but I think it's not necessary for this feature to output anything.

Adriankhl · 2024-06-06T04:24:01Z

@0cc4m Ok, changed it.

I still keep the std::cerr output when GGML_VULKAN_DEBUG is enabled, I think it is good to let the user know which driver is selected.

0cc4m · 2024-06-08T08:38:32Z

@0cc4m Ok, changed it.

I still keep the std::cerr output when GGML_VULKAN_DEBUG is enabled, I think it is good to let the user know which driver is selected.

I think informing the user which driver is in use is helpful. But GGML_VULKAN_DEBUG is a very verbose debugging mode for developers, so that's not gonna help with that. Your debug outputs should stay like that I think, but if you want to inform the user you could add this:

diff --git a/ggml-vulkan.cpp b/ggml-vulkan.cpp
index 7a82f9f9..0d8c1cd7 100644
--- a/ggml-vulkan.cpp
+++ b/ggml-vulkan.cpp
@@ -1550,8 +1550,10 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     vk::PhysicalDeviceProperties2 props2;
     vk::PhysicalDeviceMaintenance3Properties props3;
     vk::PhysicalDeviceSubgroupProperties subgroup_props;
+    vk::PhysicalDeviceDriverProperties driver_props;
     props2.pNext = &props3;
     props3.pNext = &subgroup_props;
+    subgroup_props.pNext = &driver_props;
     physical_device.getProperties2(&props2);

     const size_t subgroup_size = subgroup_props.subgroupSize;
@@ -1595,7 +1597,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     fp16 = fp16 && vk12_features.shaderFloat16;

     std::string device_name = props2.properties.deviceName.data();
-    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;
+    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " (" << driver_props.driverName << ") | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;

     if (props2.properties.deviceType == vk::PhysicalDeviceType::eCpu) {
         std::cerr << "ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want." << std::endl;

That way the driver name is added to the device info.

I was able to test the code using AMD's drivers RADV and AMDVLK on Linux and it works fine. Thank you for your contribution.

Adriankhl · 2024-06-08T13:11:15Z

Now also added the driver information to ggml_vk_print_gpu_info, ready to merge

Adriankhl · 2024-06-08T14:53:04Z

Wait, NVK doesn't exist in old vulkan sdk, let me see how to fix the CI

0cc4m · 2024-06-08T15:38:20Z

I can't track your changes if you force push every time. It's okay to have multiple commits, they'll get squashed on merge. Are you still working on it?

Adriankhl · 2024-06-08T23:18:44Z

@0cc4m Sorry, I didn't know it will get squashed. Now the problem is fixed by checking the VK_HEADER_VERSION macro, also I found vk::DriverId enum is there for cpp binding, here is the diff

 static void ggml_vk_print_gpu_info(size_t idx) {
@@ -1550,8 +1567,10 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     vk::PhysicalDeviceProperties2 props2;
     vk::PhysicalDeviceMaintenance3Properties props3;
     vk::PhysicalDeviceSubgroupProperties subgroup_props;
+    vk::PhysicalDeviceDriverProperties driver_props;
     props2.pNext = &props3;
     props3.pNext = &subgroup_props;
+    subgroup_props.pNext = &driver_props;
     physical_device.getProperties2(&props2);

     const size_t subgroup_size = subgroup_props.subgroupSize;
@@ -1595,7 +1614,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     fp16 = fp16 && vk12_features.shaderFloat16;

     std::string device_name = props2.properties.deviceName.data();
-    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;
+    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " (" << driver_props.driverName << ") | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;

@@ -1724,17 +1743,19 @@ void ggml_vk_instance_init() {
                     // Smaller number -> higher priority
                     switch (old_prop.properties.vendorID) {
                         case VK_VENDOR_ID_AMD:
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_MESA_RADV)] = 1;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_AMD_OPEN_SOURCE)] = 2;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_AMD_PROPRIETARY)] = 3;
+                            driver_priorities[vk::DriverId::eMesaRadv] = 1;
+                            driver_priorities[vk::DriverId::eAmdOpenSource] = 2;
+                            driver_priorities[vk::DriverId::eAmdProprietary] = 3;
                             break;
                         case VK_VENDOR_ID_INTEL:
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_INTEL_OPEN_SOURCE_MESA)] = 1;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_INTEL_PROPRIETARY_WINDOWS)] = 2;
+                            driver_priorities[vk::DriverId::eIntelOpenSourceMESA] = 1;
+                            driver_priorities[vk::DriverId::eIntelProprietaryWindows] = 2;
                             break;
                         case VK_VENDOR_ID_NVIDIA:
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_NVIDIA_PROPRIETARY)] = 1;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_MESA_NVK)] = 2;
+                            driver_priorities[vk::DriverId::eNvidiaProprietary] = 1;
+#if defined(VK_API_VERSION_1_3) && VK_HEADER_VERSION >= 235
+                            driver_priorities[vk::DriverId::eMesaNvk] = 2;
+#endif
                             break;
                     }

richardanaya · 2024-06-19T01:16:15Z

I believe this PR broke multiple graphics cards for users of vulkan. I have two 7900xtx (exact same card), and now it will not detect two cards. Bug: #7997

github-actions bot added the Vulkan Issues specific to the Vulkan backend label May 28, 2024

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 28, 2024

0cc4m self-requested a review May 28, 2024 05:01

Adriankhl force-pushed the fix-multiple-vullkan-driver branch 2 times, most recently from 13b83f6 to 243b5ef Compare May 29, 2024 02:09

Adriankhl force-pushed the fix-multiple-vullkan-driver branch from 5d9d14d to cc612eb Compare June 6, 2024 04:25

0cc4m approved these changes Jun 8, 2024

View reviewed changes

Adriankhl force-pushed the fix-multiple-vullkan-driver branch from cc612eb to 4bc3a79 Compare June 8, 2024 13:10

Adriankhl force-pushed the fix-multiple-vullkan-driver branch from 4bc3a79 to 64cf939 Compare June 8, 2024 15:09

Adriankhl force-pushed the fix-multiple-vullkan-driver branch from 64cf939 to b791f7f Compare June 8, 2024 22:47

vulkan: select only one device for single gpu with multiple drivers

e8a6515

Adriankhl force-pushed the fix-multiple-vullkan-driver branch from b791f7f to e8a6515 Compare June 8, 2024 23:15

0cc4m merged commit 73bac2b into ggerganov:master Jun 11, 2024
53 of 66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: select only one device for single gpu with multiple drivers #7582

vulkan: select only one device for single gpu with multiple drivers #7582

Adriankhl commented May 28, 2024 •

edited

Loading

0cc4m commented May 28, 2024

Adriankhl commented May 29, 2024 •

edited

Loading

0cc4m commented Jun 2, 2024

Adriankhl commented Jun 6, 2024

0cc4m commented Jun 8, 2024

Adriankhl commented Jun 8, 2024

Adriankhl commented Jun 8, 2024

0cc4m commented Jun 8, 2024

Adriankhl commented Jun 8, 2024 •

edited

Loading

richardanaya commented Jun 19, 2024

vulkan: select only one device for single gpu with multiple drivers #7582

vulkan: select only one device for single gpu with multiple drivers #7582

Conversation

Adriankhl commented May 28, 2024 • edited Loading

0cc4m commented May 28, 2024

Adriankhl commented May 29, 2024 • edited Loading

0cc4m commented Jun 2, 2024

Adriankhl commented Jun 6, 2024

0cc4m commented Jun 8, 2024

Adriankhl commented Jun 8, 2024

Adriankhl commented Jun 8, 2024

0cc4m commented Jun 8, 2024

Adriankhl commented Jun 8, 2024 • edited Loading

richardanaya commented Jun 19, 2024

Adriankhl commented May 28, 2024 •

edited

Loading

Adriankhl commented May 29, 2024 •

edited

Loading

Adriankhl commented Jun 8, 2024 •

edited

Loading