Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: select only one device for single gpu with multiple drivers #7582

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

Adriankhl
Copy link
Contributor

@Adriankhl Adriankhl commented May 28, 2024

An issue which an user encountered using my library: Adriankhl/godot-llm#14

Sometimes, you have multiple drivers installed for the same GPU, e.g.

And in vulkan, the GPU will be listed as separated physical devices. It causes an issue when llama.cpp attempts to split the layer.

One workaround is to set split mode to "none", which didn't work before but I fixed it here: #7552

This PR fixes the issue by checking the deviceID, and if there are two devices with the same ID, pick the one with a larger heap size.

I don't have a discrete GPU myself so this needs to be tested by someone else with a discrete GPU and multiple drivers for the same GPU.

The device selection criteria is also a bit naive, if anyone has a better suggestion please let me know.

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label May 28, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 28, 2024
@0cc4m 0cc4m self-requested a review May 28, 2024 05:01
@0cc4m
Copy link
Collaborator

0cc4m commented May 28, 2024

I don't think picking by heap size will do anything, since that's a hardware feature any driver should report identical numbers there.

Currently you can handle this case manually by setting the devices to be used with GGML_VK_VISIBLE_DEVICES or by keeping only a single driver installed (which I would recommend, generally)

I know only of Linux where AMD has 3 available drivers (RADV, AMDVLK and whatever the pro Vulkan driver is called), and Nvidia has their proprietary driver and the new NVK one that is being developed. And now there's Windows (which is the one reported on your repo), with their Vulkan to DX12 thing. I don't know if that can be disabled.

I would write the logic like this: In the AMD case, you should pick RADV. In the Nvidia case, you should pick the proprietary driver. In any other case, you could pick the first available one.

What do you think?

@Adriankhl Adriankhl force-pushed the fix-multiple-vullkan-driver branch 2 times, most recently from 13b83f6 to 243b5ef Compare May 29, 2024 02:09
@Adriankhl
Copy link
Contributor Author

Adriankhl commented May 29, 2024

@0cc4m Make sense. I get a list of driver names here: https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.2&name=driverName&platform=all

And this is my current priorities, smaller numbers are prioritized

                    std::map<std::string, int> driver_priorities {};
                    driver_priorities["NVIDIA"] = 1;
                    driver_priorities["nvk"] = 2;
                    driver_priorities["radv"] = 3;
                    driver_priorities["AMD open-source driver"] = 4;
                    driver_priorities["AMD proprietary driver"] = 5;
                    driver_priorities["Intel open-source Mesa driver"] = 6;
                    driver_priorities["Intel Corporation"] = 7;

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 2, 2024

That's better, but I would prefer if we avoided the string comparison here. I propose to use the data in VkPhysicalDeviceDriverProperties. We can use the device vendor ID similar to here and afterwards select by the driver ID.

That should simplify the code.

Also, please don't write to stdout. stderr is sometimes okay, but I think it's not necessary for this feature to output anything.

@Adriankhl
Copy link
Contributor Author

@0cc4m Ok, changed it.

I still keep the std::cerr output when GGML_VULKAN_DEBUG is enabled, I think it is good to let the user know which driver is selected.

@Adriankhl Adriankhl force-pushed the fix-multiple-vullkan-driver branch from 5d9d14d to cc612eb Compare June 6, 2024 04:25
@0cc4m
Copy link
Collaborator

0cc4m commented Jun 8, 2024

@0cc4m Ok, changed it.

I still keep the std::cerr output when GGML_VULKAN_DEBUG is enabled, I think it is good to let the user know which driver is selected.

I think informing the user which driver is in use is helpful. But GGML_VULKAN_DEBUG is a very verbose debugging mode for developers, so that's not gonna help with that. Your debug outputs should stay like that I think, but if you want to inform the user you could add this:

diff --git a/ggml-vulkan.cpp b/ggml-vulkan.cpp
index 7a82f9f9..0d8c1cd7 100644
--- a/ggml-vulkan.cpp
+++ b/ggml-vulkan.cpp
@@ -1550,8 +1550,10 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     vk::PhysicalDeviceProperties2 props2;
     vk::PhysicalDeviceMaintenance3Properties props3;
     vk::PhysicalDeviceSubgroupProperties subgroup_props;
+    vk::PhysicalDeviceDriverProperties driver_props;
     props2.pNext = &props3;
     props3.pNext = &subgroup_props;
+    subgroup_props.pNext = &driver_props;
     physical_device.getProperties2(&props2);

     const size_t subgroup_size = subgroup_props.subgroupSize;
@@ -1595,7 +1597,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     fp16 = fp16 && vk12_features.shaderFloat16;

     std::string device_name = props2.properties.deviceName.data();
-    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;
+    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " (" << driver_props.driverName << ") | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;

     if (props2.properties.deviceType == vk::PhysicalDeviceType::eCpu) {
         std::cerr << "ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want." << std::endl;

That way the driver name is added to the device info.

I was able to test the code using AMD's drivers RADV and AMDVLK on Linux and it works fine. Thank you for your contribution.

@Adriankhl Adriankhl force-pushed the fix-multiple-vullkan-driver branch from cc612eb to 4bc3a79 Compare June 8, 2024 13:10
@Adriankhl
Copy link
Contributor Author

Now also added the driver information to ggml_vk_print_gpu_info, ready to merge

@Adriankhl
Copy link
Contributor Author

Wait, NVK doesn't exist in old vulkan sdk, let me see how to fix the CI

@Adriankhl Adriankhl force-pushed the fix-multiple-vullkan-driver branch from 4bc3a79 to 64cf939 Compare June 8, 2024 15:09
@0cc4m
Copy link
Collaborator

0cc4m commented Jun 8, 2024

I can't track your changes if you force push every time. It's okay to have multiple commits, they'll get squashed on merge. Are you still working on it?

@Adriankhl Adriankhl force-pushed the fix-multiple-vullkan-driver branch from 64cf939 to b791f7f Compare June 8, 2024 22:47
@Adriankhl Adriankhl force-pushed the fix-multiple-vullkan-driver branch from b791f7f to e8a6515 Compare June 8, 2024 23:15
@Adriankhl
Copy link
Contributor Author

Adriankhl commented Jun 8, 2024

@0cc4m Sorry, I didn't know it will get squashed. Now the problem is fixed by checking the VK_HEADER_VERSION macro, also I found vk::DriverId enum is there for cpp binding, here is the diff

 static void ggml_vk_print_gpu_info(size_t idx) {
@@ -1550,8 +1567,10 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     vk::PhysicalDeviceProperties2 props2;
     vk::PhysicalDeviceMaintenance3Properties props3;
     vk::PhysicalDeviceSubgroupProperties subgroup_props;
+    vk::PhysicalDeviceDriverProperties driver_props;
     props2.pNext = &props3;
     props3.pNext = &subgroup_props;
+    subgroup_props.pNext = &driver_props;
     physical_device.getProperties2(&props2);

     const size_t subgroup_size = subgroup_props.subgroupSize;
@@ -1595,7 +1614,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
     fp16 = fp16 && vk12_features.shaderFloat16;

     std::string device_name = props2.properties.deviceName.data();
-    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;
+    std::cerr << GGML_VK_NAME << idx << ": " << device_name << " (" << driver_props.driverName << ") | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;

@@ -1724,17 +1743,19 @@ void ggml_vk_instance_init() {
                     // Smaller number -> higher priority
                     switch (old_prop.properties.vendorID) {
                         case VK_VENDOR_ID_AMD:
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_MESA_RADV)] = 1;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_AMD_OPEN_SOURCE)] = 2;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_AMD_PROPRIETARY)] = 3;
+                            driver_priorities[vk::DriverId::eMesaRadv] = 1;
+                            driver_priorities[vk::DriverId::eAmdOpenSource] = 2;
+                            driver_priorities[vk::DriverId::eAmdProprietary] = 3;
                             break;
                         case VK_VENDOR_ID_INTEL:
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_INTEL_OPEN_SOURCE_MESA)] = 1;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_INTEL_PROPRIETARY_WINDOWS)] = 2;
+                            driver_priorities[vk::DriverId::eIntelOpenSourceMESA] = 1;
+                            driver_priorities[vk::DriverId::eIntelProprietaryWindows] = 2;
                             break;
                         case VK_VENDOR_ID_NVIDIA:
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_NVIDIA_PROPRIETARY)] = 1;
-                            driver_priorities[static_cast<vk::DriverId>(VkDriverId::VK_DRIVER_ID_MESA_NVK)] = 2;
+                            driver_priorities[vk::DriverId::eNvidiaProprietary] = 1;
+#if defined(VK_API_VERSION_1_3) && VK_HEADER_VERSION >= 235
+                            driver_priorities[vk::DriverId::eMesaNvk] = 2;
+#endif
                             break;
                     }

@0cc4m 0cc4m merged commit 73bac2b into ggerganov:master Jun 11, 2024
53 of 66 checks passed
@richardanaya
Copy link

I believe this PR broke multiple graphics cards for users of vulkan. I have two 7900xtx (exact same card), and now it will not detect two cards. Bug: #7997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants