Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Errors compute shaders - vkResetCommandPool, vkDestroyBuffer #2473

Closed
peters-david opened this issue Feb 10, 2022 · 15 comments
Closed
Labels
help required We need community help to make this happen. type: bug Something isn't working

Comments

@peters-david
Copy link

I get a

0 took 5.152398114s
1 took 921.852867ms
MESA-INTEL: error: ../src/intel/vulkan/anv_device.c:3713: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)
thread 'main' panicked at 'Error in Queue::submit: parent device is lost', /home/david/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/6931e57/wgpu/src/backend/direct.rs:231:9

when running this code with "cargo run --release".

https://gist.github.com/peters-david/70a7a7ee6526cb35fe7f7b028cb820f5

Looks like the first two calls to use_gpu() work fine, after that it panics.
Is this intended? If so, why does it work the first two times?
If i remove the lazy_static and request a new wgpu::Instance in use_gpu() it works fine.

Sorry if this is a stupid question, just got started with Rust and wgpu.

I run it on a i7-1065G7 with the ICL GT2 on Linux Ubuntu 21.10.

@kvark
Copy link
Member

kvark commented Feb 11, 2022

Please check if you are using latest Intel drivers. We've been submitting issues lately, which are in different stages of fixing on Mesa side.

@kvark
Copy link
Member

kvark commented Feb 11, 2022

Also, please install Vulkan validation layers if you haven't already. I wonder if it spews out any useful info.

@kvark kvark added type: bug Something isn't working help required We need community help to make this happen. labels Feb 11, 2022
@peters-david
Copy link
Author

peters-david commented Feb 12, 2022

The latest drivers are installed.

Vulkan validation shows this:

[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkResetCommandPool-commandPool-00040 (0xb53e2331)]
        Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x55b6783678c0, name = _Transit, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xb53e2331 | Attempt to reset command pool with VkCommandBuffer 0x55b6783678c0[_Transit] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.2.198.0/linux/1.2-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance]         objects: (type: COMMAND_BUFFER, hndl: 0x55b6783678c0, name: _Transit)
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkResetCommandPool-commandPool-00040 (0xb53e2331)]
        Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x55b6783691c0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xb53e2331 | Attempt to reset command pool with VkCommandBuffer 0x55b6783691c0[] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.2.198.0/linux/1.2-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance]         command buffers: compute pass
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance]         objects: (type: COMMAND_BUFFER, hndl: 0x55b6783691c0, name: ?)
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkDestroyBuffer-buffer-00922 (0xe4549c11)]
        Validation Error: [ VUID-vkDestroyBuffer-buffer-00922 ] Object 0: handle = 0xe7e6d0000000000f, name = <init_buffer>, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xe4549c11 | Cannot free VkBuffer 0xe7e6d0000000000f[<init_buffer>] that is in use by a command buffer. The Vulkan spec states: All submitted commands that refer to buffer, either directly or via a VkBufferView, must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.198.0/linux/1.2-extensions/vkspec.html#VUID-vkDestroyBuffer-buffer-00922)
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance]         objects: (type: BUFFER, hndl: 0xe7e6d0000000000f, name: <init_buffer>)
[2022-02-12T11:25:41Z INFO  wgpu_core::device] Buffer (1, 1, Vulkan) is dropped
[2022-02-12T11:25:41Z INFO  wgpu_core::device] Buffer (0, 1, Vulkan) is dropped
0 took 5.292182785s
[2022-02-12T11:25:41Z INFO  wgpu_core::device] Created buffer Valid((2, 1, Vulkan)) with BufferDescriptor { label: Some("cpu buffer"), size: 4194304, usage: MAP_READ | COPY_DST, mapped_at_creation: false }
[2022-02-12T11:25:41Z INFO  wgpu_core::device] Created buffer Valid((3, 1, Vulkan)) with BufferDescriptor { label: Some("NN Buffer"), size: 4194304, usage: COPY_SRC | COPY_DST | STORAGE, mapped_at_creation: true }
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkDestroyBuffer-buffer-00922 (0xe4549c11)]
        Validation Error: [ VUID-vkDestroyBuffer-buffer-00922 ] Object 0: handle = 0xcad092000000000d, name = NN Buffer, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xe4549c11 | Cannot free VkBuffer 0xcad092000000000d[NN Buffer] that is in use by a command buffer. The Vulkan spec states: All submitted commands that refer to buffer, either directly or via a VkBufferView, must have completed execution (https://vulkan.lunarg.com/doc/view/1.2.198.0/linux/1.2-extensions/vkspec.html#VUID-vkDestroyBuffer-buffer-00922)
[2022-02-12T11:25:41Z ERROR wgpu_hal::vulkan::instance]         objects: (type: BUFFER, hndl: 0xcad092000000000d, name: NN Buffer)
[2022-02-12T11:25:43Z INFO  wgpu_core::device] Buffer (3, 1, Vulkan) is dropped
[2022-02-12T11:25:43Z INFO  wgpu_core::device] Buffer (2, 1, Vulkan) is dropped
1 took 1.237099087s
[2022-02-12T11:25:43Z INFO  wgpu_core::device] Created buffer Valid((0, 2, Vulkan)) with BufferDescriptor { label: Some("cpu buffer"), size: 4194304, usage: MAP_READ | COPY_DST, mapped_at_creation: false }
[2022-02-12T11:25:43Z INFO  wgpu_core::device] Created buffer Valid((1, 2, Vulkan)) with BufferDescriptor { label: Some("NN Buffer"), size: 4194304, usage: COPY_SRC | COPY_DST | STORAGE, mapped_at_creation: true }
MESA-INTEL: error: ../src/intel/vulkan/anv_device.c:3713: GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST)
thread 'main' panicked at 'Error in Queue::submit: parent device is lost', /home/david/.cargo/git/checkouts/wgpu-53e70f8674b08dd4/6931e57/wgpu/src/backend/direct.rs:231:9

note: in order to run my example with cargo run line 44 has to be changed to let load = vec![0; 4096*256].into_boxed_slice();

I tested on another system running Ubuntu 20.04 and Nvidia M4000 and it works there.
HOWEVER in another project (can't share the code) i am getting the same errors on the M4000 system, even when rerequesting the wgpu instance.
Looks to me like it is related to compute shaders that take a long time to finish.
I am using the M4000 machine in text mode and only connect via ssh, so it shouldn't be related to some OS induced timeout.

edit: spelling

@peters-david
Copy link
Author

peters-david commented Feb 12, 2022

In the private project i am getting the same errors as above but the device isnt lost.
Instead im getting an output of all 0s, the same might be happening in #1881

@peters-david peters-david changed the title GPU hung on one of our command buffers (VK_ERROR_DEVICE_LOST) when using lazy_static Validation Errors compute shaders - vkResetCommandPool, vkDestroyBuffer Feb 12, 2022
@kvark
Copy link
Member

kvark commented Feb 13, 2022

Not exactly sure what is happening here, but it may very well be related that the tests we have are spewing errors as well when ran on multiple threads (from just cargo test on Linux). This is worrying, and we should fix this ASAP.

@kvark kvark pinned this issue Feb 13, 2022
@peters-david
Copy link
Author

Ok, let me know if you need anything more to reproduce / get additional information

@kvark
Copy link
Member

kvark commented Feb 13, 2022

Actually, the errors I was seeing are fixed by #2476, they are unrelated to your case. Moreover, we are running the compute test concurrently in cargo test, so it should be exercising similar paths. Is cargo test running well for you?

@peters-david
Copy link
Author

peters-david commented Feb 13, 2022

On the ICL GT2 the tests seem to pass although im getting validation errors.
icl_gt2.txt

On the M4000 test conservative_raster fails.
m4000.txt

@peters-david
Copy link
Author

Your vk-astc branch shows all tests passed, no validation errors on ICL GT2.
On M4000 the conservative_raster still fails on vk-astc.

@kvark
Copy link
Member

kvark commented Feb 14, 2022

Ok, thank you for confirming! Would you be able to push your test case to a branch of wgpu somewhere, so that we can test it?

@peters-david
Copy link
Author

Find it in https://github.com/peters-david/wgpu
The test is in wgpu/tests

@kvark
Copy link
Member

kvark commented Feb 15, 2022

Thanks @peters-david ! I ran your test on Intel Xe graphics (integrated).
First thing:

---- long_running_shader stdout ----
0
thread 'long_running_shader' panicked at 'assertion failed: `(left == right)`
  left: `100000`,
 right: `0`', wgpu/tests/long_running_shader.rs:120:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This is due to a problem with your shader logic, which doesn't really set all the indices:

@builtin(global_invocation_id) id: vec3<u32>
@builtin(local_invocation_index) index: u32
var i: u32 = 256u*id.x+index;

For thread (1,0,0) within the only working group, we'll have id.x == 1 and index == 1, so i == 257.

Once I fix this, the test just runs indefinitely (as you designed). No validation errors or warnings or asserts.

@peters-david
Copy link
Author

peters-david commented Feb 15, 2022

The indeces don't start at 0??

@kvark
Copy link
Member

kvark commented Feb 15, 2022

Indices start at 0, and your array[0] will be initialized. But array[1] will not ever be, and the assertion triggers that I posted.

@peters-david
Copy link
Author

Thank you @kvark !
I misunderstood the relation between workgroups, invocations and the difference between local & global.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help required We need community help to make this happen. type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants