-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accurate event for when a swapchain image is visible on screen #370
Comments
Hi @haasn, What you are asking for is very reasonable. Unfortunately, we don't have a solution for you at this time. Khronos is working on it, but I'm sorry to say that there's no estimated time for when we'll be shipping a solution. I'd like to understand what you want, to compare it with other requests we've received. Is your goal to call vkQueuePresentKHR() and then be able to find out when the image(s) are actually presented? Something different? Something additional? To clarify, your description makes it sound like you are using the same semaphore for multiple purposes, which is not correct. Was that just to make it easier for you to describe? Thanks for your input/feedback! |
My ultimate goal is to keep audio and video playback synchronized while minimizing glitches due to repeated or droppde frames, which requires measuring 1. display refresh rate, and 2. frame skips. In the case of 1., I do not want to rely on the EDID information or “reported” display refresh rate alone, but I want to measure it in realtime, since these can be both inconsistent and subtly different. In the case of 2., I need to know when I've dropped a vsync due to rendering too slowly. For example, imagine a program which uses a swapchain of size 4, acquires 4 imags, submits 4 draw calls, and 4 It's worth noting that the approaches I have already outlined (waiting on an event which the end of my rendering command emits, and waiting on a fence signalling the next image was acquired) both cover my needs already, in the limit. The only complaint I have about them is that the timing gets thrown off near the beginning of playback, which I'm trying to minimize since it can throw off stuff like averaging filters for the duration of the averaging window. If I had to design an API for this myself, I would loosely suggest the following:
Of these two, 2. might be the more powerful of the two approaches, since it solves a number of problems:
It also requires no changes to existing API calls. So all things considered, that's the approach I'd be happiest with, I think.
I wrote this post before having a solid understanding of the rules for semaphore use and ordering. You're right in that you'd usually use a pair of semaphores for each image in a swapchain. That said, I think you could re-use the same semaphore for both directions as long as it's done on the same VkQueue. Either way, I don't think the distinction is meaningful for this problem. |
@haasn, thanks for your input! It makes sense and will help us design a good solution.
Yes, it was orthogonal to the main topic (just an FYI). |
Without knowing much about Vulkan, I think such an API should provide the following mechanisms:
Here are some links to other display APIs, which try to deal with this, for better or worse, in no specific order: |
VK_GOOGLE_display_timing nearly all of that already... |
It's worth pointing out that More interestingly, nvidia has added support for |
Upon re-approaching this problem, I noticed that this is not just a requirement for “accurate” vsync timing the way mpv does it - this is in fact a basic requirement for simply metering rendering to the display rate at all. (i.e. implementing vsync). The vulkan samples I'm looking at (e.g. cube.c from LunarG/VulkanSamples) seem to essentially do this:
But this appears to have a rather serious bug: It only waits on the vkQueueSubmit to complete, not on the actual vkQueuePresentKHR. So if you imagine a GPU that renders a cube at 1000 fps, the fences would all fire 1ms after the corresponding vkQueueSubmit, and thus the only factor metering rendering speed here is the implicit assumption that vkAcquireNextImageKHR will block if the available swapchain images are all stuck in the presentation queue (due to the If even demo applications like cube.c seem to get this wrong, then I'm at a complete loss for what khronos expects the correct behavior to look like. It seems like changing this at the source requires either vkQueuePresentKHR to signal a fence once the frame either leaves the frame queue (and becomes the active front-buffer), or once it's done being (and the contents have effectively been fully sent to the GPU). Alternatively, vkQueuePresentKHR could be redesigned to be part of a command buffer - so you could |
It also seems like VK_EXT_display_control may not be as good a solution for this problem as I had originally anticipated: It requires a VkDisplayKHR, which I can't necessarily easily figure out. (Shouldn't the VkSurface have this information?) It also has a very, very awkward design. (For some reason it seems to violate vulkan API conventions by requiring that the pAllocator be non-NULL. I don't have a custom allocator though, can't it just use malloc like everything else? Or is that because it expects me to allocate a new fence for every vsync? Why can't it just re-use the same fence like literally everything other command?) |
@haasn, I agree the first pixel event in VK_EXT_display_control is not a good solution for this problem. It simply generates a signal when the next vblank occurs. That doesn't necessarily correspond to when any prior-submitted presentation command completes. I agree the Google display timing spec is a closer match to your needs, but it is unlikely we will ever implement it outside of Android. Its semantics don't align with the capabilities available to us across other operating systems. We'll continue to work on a general solution for this problem within the Khronos working groups. As @ianelliottus mentioned, we're aware it's a sorely needed bit of functionality missing from the current specs. There's nothing special about the allocator requirements of the functions in VK_EXT_display_control. They will fall back to the system allocator if pAllocator is NULL. If you're seeing issues with that, let me know, and ideally provide some code snippets illustrating the problem. This would be a bug. Yes, the notifications in VK_EXT_display_control require using VK_KHR_display. Note that doesn't mean you need to be using a swapchain that presents to a VK_KHR_display. VK_KHR_display just allows enumerating displays, and VK_EXT_display_control let's you wait for events on those displays. You'd need some way to figure out what display your window system is using for swapchains presenting to it though to correlate the events back to your presentation commands. You could do this on X11 with the RANDR correlation function provided in VK_EXT_acquire_xlib_display. I'm not aware of definitive solutions available for other platforms at the moment, but you could compare display names with some native API to make an educated guess. Yes a new fence does need to be allocated for each vblank. This design choice of creating a fence when requesting the events was made because these fences were different enough from regular fences that we would essentially have to do the equivalent of re-creating the fence anyway within the driver to convert an existing fence into a vblank event, and I needed to ensure they were not shareable using the new fence export extensions. The need to create a new fence every time was a side effect of that. In retrospect, I wish I'd created a new object type entirely to handle these notifications, and allowed them to be reusable. If there's ever a KHR version of this functionality, that's likely the direction I'll recommend. |
From the spec:
This goes against the convention of most other pAllocator functions, which all state:
So it's actually a spec-documented deviation, not an implementation bug. The validation layers also confirm this:
(But perhaps it's a bug in the specification) |
That is indeed a bug in the spec. Thanks for pointing it out. I'll get it fixed. |
This time based on RA. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful nvidia-exclusive hack that barely works and is held together with duct tape and prayers. Long-term, we really, REALLY need to figure out a way to use a GLSL->SPIR-V middleware like glslang. The problem with glslang in particular is that it's a gigantic pile of awful, but maybe time will help here.. 2. We don't use async transfer at all. This is very difficult, but doable in theory with the newer design. Would require refactoring vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include commands on the async queue as well. Also, async compute is pretty much impossible to benefit from because we need to pingpong with serial dependencies anyway. (Sorry AMD users, you fell for the async compute meme) 3. Lots of resource deallocation callbacks are thread-safe (because the vulkan device itself is, and once we've added a free callback we're pretty much guaranteed to never use that resource again from within mpv). As such, we could call those cleanup callbacks from a different thread. This would make stuff slightly more responsive when deallocating lots of resources at once. (e.g. resizing swapchain) 4. The custom memory allocator is pretty naive. It's prone to under-allocating memory, allocation thrashing, freeing slabs too aggressively, and general slowness due to allocating from the same thread. In addition to making it smarter, we should also make it multi-threaded: ideally it would free slabs from a different thread, and also pre-allocate slabs from a different thread if it reaches some critical "low" threshold on the amount of available bytes. (Perhaps relative to the current heap size). These limitations manifest themselves as occasional choppy performance when changing the window size. 5. The swapchain code and ANGLE's swapchain code could share common options somehow. Left away for now because I don't want to deal with that headache for the time being. 6. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370)
This time based on RA. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful nvidia-exclusive hack that barely works and is held together with duct tape and prayers. Long-term, we really, REALLY need to figure out a way to use a GLSL->SPIR-V middleware like glslang. The problem with glslang in particular is that it's a gigantic pile of awful, but maybe time will help here.. 2. We don't use async transfer at all. This is very difficult, but doable in theory with the newer design. Would require refactoring vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include commands on the async queue as well. Also, async compute is pretty much impossible to benefit from because we need to pingpong with serial dependencies anyway. (Sorry AMD users, you fell for the async compute meme) 3. The custom memory allocator is pretty naive. It's prone to under-allocating memory, allocation thrashing, freeing slabs too aggressively, and general slowness due to allocating from the same thread. In addition to making it smarter, we should also make it multi-threaded: ideally it would free slabs from a different thread, and also pre-allocate slabs from a different thread if it reaches some critical "low" threshold on the amount of available bytes. (Perhaps relative to the current heap size). These limitations manifest themselves as occasional choppy performance when changing the window size. 4. The swapchain code and ANGLE's swapchain code could share common options somehow. Left away for now because I don't want to deal with that headache for the time being. 5. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370)
This time based on RA. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful nvidia-exclusive hack that barely works and is held together with duct tape and prayers. Long-term, we really, REALLY need to figure out a way to use a GLSL->SPIR-V middleware like glslang. The problem with glslang in particular is that it's a gigantic pile of awful, but maybe time will help here.. 2. We don't use async transfer at all. This is very difficult, but doable in theory with the newer design. Would require refactoring vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include commands on the async queue as well. Also, async compute is pretty much impossible to benefit from because we need to pingpong with serial dependencies anyway. (Sorry AMD users, you fell for the async compute meme) 3. The custom memory allocator is pretty naive. It's prone to under-allocating memory, allocation thrashing, freeing slabs too aggressively, and general slowness due to allocating from the same thread. In addition to making it smarter, we should also make it multi-threaded: ideally it would free slabs from a different thread, and also pre-allocate slabs from a different thread if it reaches some critical "low" threshold on the amount of available bytes. (Perhaps relative to the current heap size). These limitations manifest themselves as occasional choppy performance when changing the window size. 4. The swapchain code and ANGLE's swapchain code could share common options somehow. Left away for now because I don't want to deal with that headache for the time being. 5. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370)
This time based on RA. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful nvidia-exclusive hack that barely works and is held together with duct tape and prayers. Long-term, we really, REALLY need to figure out a way to use a GLSL->SPIR-V middleware like glslang. The problem with glslang in particular is that it's a gigantic pile of awful, but maybe time will help here.. 2. We don't use async transfer at all. This is very difficult, but doable in theory with the newer design. Would require refactoring vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include commands on the async queue as well. Also, async compute is pretty much impossible to benefit from because we need to pingpong with serial dependencies anyway. (Sorry AMD users, you fell for the async compute meme) 3. The custom memory allocator is pretty naive. It's prone to under-allocating memory, allocation thrashing, freeing slabs too aggressively, and general slowness due to allocating from the same thread. In addition to making it smarter, we should also make it multi-threaded: ideally it would free slabs from a different thread, and also pre-allocate slabs from a different thread if it reaches some critical "low" threshold on the amount of available bytes. (Perhaps relative to the current heap size). These limitations manifest themselves as occasional choppy performance when changing the window size. 4. The swapchain code and ANGLE's swapchain code could share common options somehow. Left away for now because I don't want to deal with that headache for the time being. 5. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370)
This time based on RA. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful nvidia-exclusive hack that barely works and is held together with duct tape and prayers. Long-term, we really, REALLY need to figure out a way to use a GLSL->SPIR-V middleware like glslang. The problem with glslang in particular is that it's a gigantic pile of awful, but maybe time will help here.. 2. We don't use async transfer at all. This is very difficult, but doable in theory with the newer design. Would require refactoring vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include commands on the async queue as well. Also, async compute is pretty much impossible to benefit from because we need to pingpong with serial dependencies anyway. (Sorry AMD users, you fell for the async compute meme) 3. The custom memory allocator is pretty naive. It's prone to under-allocating memory, allocation thrashing, freeing slabs too aggressively, and general slowness due to allocating from the same thread. In addition to making it smarter, we should also make it multi-threaded: ideally it would free slabs from a different thread, and also pre-allocate slabs from a different thread if it reaches some critical "low" threshold on the amount of available bytes. (Perhaps relative to the current heap size). These limitations manifest themselves as occasional choppy performance when changing the window size. 4. The swapchain code and ANGLE's swapchain code could share common options somehow. Left away for now because I don't want to deal with that headache for the time being. 5. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370)
This time based on RA. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful nvidia-exclusive hack that barely works and is held together with duct tape and prayers. Long-term, we really, REALLY need to figure out a way to use a GLSL->SPIR-V middleware like glslang. The problem with glslang in particular is that it's a gigantic pile of awful, but maybe time will help here.. 2. We don't use async transfer at all. This is very difficult, but doable in theory with the newer design. Would require refactoring vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include commands on the async queue as well. Also, async compute is pretty much impossible to benefit from because we need to pingpong with serial dependencies anyway. (Sorry AMD users, you fell for the async compute meme) 3. The custom memory allocator is pretty naive. It's prone to under-allocating memory, allocation thrashing, freeing slabs too aggressively, and general slowness due to allocating from the same thread. In addition to making it smarter, we should also make it multi-threaded: ideally it would free slabs from a different thread, and also pre-allocate slabs from a different thread if it reaches some critical "low" threshold on the amount of available bytes. (Perhaps relative to the current heap size). These limitations manifest themselves as occasional choppy performance when changing the window size. 4. The swapchain code and ANGLE's swapchain code could share common options somehow. Left away for now because I don't want to deal with that headache for the time being. 5. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370)
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Use async compute on supported devices. 4. Could/should use sub-command buffers instead of semaphores/switching for stuff involving multiple queue families. 5. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Use async compute on supported devices. 4. Could/should use sub-command buffers instead of semaphores/switching for stuff involving multiple queue families. 5. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Use async compute on supported devices. 4. Could/should use sub-command buffers instead of semaphores/switching for stuff involving multiple queue families. 5. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Use async compute on supported devices. 4. Could/should use sub-command buffers instead of semaphores/switching for stuff involving multiple queue families. 5. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. vo_gpu: vulkan: implement ra_vk_ctx.depth Also moved the depth querying for vo_gpu from preinit to resize, since it was a tiny bit more convenient. (And in theory, it could change during runtime anyway) This only affects a calculation in the dither code path anyway.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. vo_gpu: vulkan: implement ra_vk_ctx.depth Also moved the depth querying for vo_gpu from preinit to resize, since it was a tiny bit more convenient. (And in theory, it could change during runtime anyway) This only affects a calculation in the dither code path anyway.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. vo_gpu: vulkan: implement ra_vk_ctx.depth Also moved the depth querying for vo_gpu from preinit to resize, since it was a tiny bit more convenient. (And in theory, it could change during runtime anyway) This only affects a calculation in the dither code path anyway.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. KhronosGroup/Vulkan-Docs#370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result.
It's been over a year, Has there been any progress on this issue? (Also, sorry for the commit spam) One thing I tried doing to solve this bug in practice (if not in theory) is to use the time that I had a closer look at the (now renamed)
That said, in theory it might be possible to combine the vblank event with the swapchain counters, in the following manner:
But this does not seem like a clean solution, nor do I know how well it extends to e.g. mailbox-style swapchains. Things could could help me include:
|
We are working on this in Khronos, and hope to have a new extension that solves this, in the early part of next year. |
Any progress to report on this? |
It looks like the newly announced 0.9 provisional OpenXR spec has the required features for proper frame timing. It seems a little odd that vulkan applications without AR/VR have to integrate with OpenXR, but whatever.
Since OpenXR seems to have endorsements from AMD, NVIDIA, Intel, and Microsoft, it seems to be the most likely way forward that will actually be implemented unless Khronos is planning to announce a Vulkan specific extension to duplicate this functionality. |
3-4 years later, Vulkan present types are still a joke, VSync timing isn't a thing, and the only way to mitigate this is by abusing mailbox plus overpowered hardware resources to bruteforce the timing. What the hell is going on? |
Has there been any more progress on this @ianelliottus @cubanismo? Or is there a different recommended way to get the functionality of I've seen the need for this pop up and be mentioned for quite a while now but doesn't seem like the needle has moved sadly 😢 |
Yes, please refer to #1364 for solution, this thread is dead and can be closed. |
please refer to #1364 for solution |
Would it help with safely destroying the semaphore awaited by vkQueuePresentKHR (in a situation when a full vkQueueWaitIdle or vkDeviceWaitIdle is overkill), or is it limited to just querying time intervals? |
@Triang3l It is broke. As per #152 not even The extension does add another way to infer the semaphore state. Nevertheless it is something that should be fixed in core 1.0 and not by usage of extensions. Besides, busywaiting on It would also be getting on the thin ice a bit. |
There is an update, long in coming, to address these scenarios. Specifically semaphore states for one. Its priority |
Oh, nice, thank you for helping in resolving this confusing part! What is the current "industry standard" solution to this issue, by the way? Acquiring all images (not sure if that's possible for the mailbox mode) and awaiting all fences? Full WaitIdle? Or would just destroying the swapchain before the semaphores be enough (or is vkDestroySwapchainKHR also affected by this lack of a fence, and doesn't have implicit lifetime tracking)? |
I see no way currently to figure out when a swapchain image is actually visible on the screen.
Imagine an application which needs 4ms to execute a draw call and is running on a 16ms vsync display. Here's what a timeline could look like: (correct me if I'm wrong), supposing that we start the application immediately after a vsync already happened.
After this batch of setup, the following things happen:
3. t=0ms: the semaphore is signaled right away, and (optionally) the fence is triggered indicating that the image acquired in step 2 is available for use. The semaphore being signalled allows the draw command to start
4. t=4ms: The draw command finishes, and signals the semaphore again. This allows the GPU to start using the image for presentation (removing the signal). But it is not visible yet, because the next page flip has not yet occurred
5. t=16ms: The GPU flips pages and actually starts displaying the screen.
... at this point it is assume that the application also does whatever is necessary for drawing the next frame
6. t=32ms: The GPU flips pages again and stops using the surface (signalling the semaphore). Assume it takes 1ms for the image to get freed up and be reusable again
7. t=33ms: The application would be able to acquire the image again (i.e. triggering the fence)
To summarize it, on the CPU side of things I can get accurate information about the following points in time:
But I can't seem to get any reliable information about t=16ms, i.e. when the frame I just submitted is actually visible. This is important to me because I need to measure display latency and effective refresh rate accurately.
The problem gets worse if I use a large swapchain. For example, suppose my swapchain is size 4.
In the first world, i.e. where I wait on the fence indicating that the image is ready for use again, I would measure differences in frame times something like this:
...
In the second world, i.e. where I trigger an event once I've finished rendering and wait on that to complete, I would measure frame times like this:
...
Basically, they all converge to the true vsync timing (16ms) in the limit, but the measurements at the start will always be off since the GPU can already acquire the next image and/or render to it well in advance of when it will actually be used.
How do you advise accomplishing what I want? (Measuring the real delay between submitting a frame and it being visible on screen)
The text was updated successfully, but these errors were encountered: