-
Notifications
You must be signed in to change notification settings - Fork 930
Developer guidelines
This page describes the behaviour of DXVK in various D3D11 workloads, and may be useful for developers targeting the Steam Deck. In general, most IHV recommendations and good practices apply to our implementation as well, but performance characteristics may differ in practice.
Note that this page is written for DXVK 2.6 and later. Older versions may behave differently.
- Compile all shaders (as in, calling
ID3D11Device::Create*Shader
) that will be used during gameplay during loading screens or in the game menu. DXVK will start creating Vulkan pipeline libraries or complete Vulkan pipelines in the background. Compiling shaders "on demand", i.e. just before they are first used in a draw, will cause stutters that are more severe than on native D3D11 drivers. - Prefer
StructuredBuffer<...>
andRWStructuredBuffer<...>
over typed buffer views when format conversion is not strictly required. Structured buffers are more efficient in our implementation, and allow Vulkan drivers to use more efficient memory access instructions. - Calling
ID3D11Device::Create*Shader
from one single thread is usually sufficient. DXVK will perform the time-consuming parts of shader compilation on dedicated worker threads. - Prefer compute shaders over full-screen triangles for post-processing whenever possible. This allows commands that have no data dependencies to overlap execution on the GPU.
- Avoid stream output. DXVK supports this feature, but it is significantly less efficient than compute shaders on our implementation and will lead to under-utilization of the GPU.
- Do not use class linkage. DXVK does not support this feature.
- In shader code, avoid unrolling large loops or loops with high iteration counts. Excessive unrolling increases compile times and makes it harder for Vulkan drivers to optimize.
- Prefer initializing read-only resources during creation via
D3D11_SUBRESOURCE_DATA
rather than usingUpdateSubresource
after creation. This may avoid a redundant clear and allows us to execute the upload on a dedicated transfer queue. - Respect the amount of video memory available on the device, within reason. Performance will suffer when the VRAM budget is exhausted.
- Ideally, use
IDXGIAdapter3::QueryVideoMemoryInfo
to query the available memory budget, as this will also account for driver-internal allocations and system software. - DXVK will defragment its internal memory allocations under memory pressure, but currently does not migrate resources in or out of VRAM. Vulkan drivers may page out allocations that contain render targets, which can lead to catastrophic performance issues on dedicated GPUs.
- Ideally, use
- Prefer strongly typed image formats for render targets and other high-bandwidth images. Using
_TYPELESS
formats may negatively affect GPU performance on some hardware, especially ifD3D11_BIND_UNORDERED_ACCESS
is set. - When updating multiple different regions of the same image subresource via
UpdateSubresource
orCopySubresourceRegion
, e.g. for virtual texturing, it is optimal for the region size to be a power of two, and the offset to be a multiple of the given size.- DXVK uses Morton codes to track hazards in image copies, which may cause over-synchronization in some cases.
- As an example, when updating three 64x64 blocks of a larger texture atlas at offsets (64,128), (64,192) and (192,320), all updates can be processed in parallel, but when updating two 64x64 blocks at e.g. (8,1) and (80,5), DXVK will insert a barrier even though the two regions do not overlap in memory.
- Use
*SetConstantBuffers1
to bind sub-ranges of a larger constant buffer. This is by far the fastest path on our implementation, and works especially well on Deferred Contexts. Ideally, only map the buffer withMAP_WRITE_DISCARD
a few times per frame and write as much data as possible at once, but if that is not viable, usingMAP_WRITE_NO_OVERWRITE
between draws is still good. - When updating constant buffers,
MAP_WRITE_DISCARD
andUpdateSubresource
have similar performance characteristics if the entire buffer is written in both cases.
- Suballocate from large index and vertex buffers and use the
StartIndexLocation
andBaseVertexLocation
draw parameters to avoid overhead from frequent re-binding. Re-binding a large number of vertex buffers on every draw call can quickly become the primary CPU bottleneck.
- For mapped resources created with
D3D11_USAGE_DYNAMIC
, always write full cache lines for optimal CPU performance. DXVK will generally allocate these resources in host-visible video memory, especially on systems with Resizeable BAR enabled. -
Never perform a CPU read from a resource that was created without the
D3D11_CPU_ACCESS_READ
flag. In the worst case, these resources are allocated in VRAM and have to be read back over PCI-E, which can very quickly become an extreme bottleneck. This also applies to inefficient write patterns such as keeping multiple partially written cache lines in flight. - Do not use large textures with
D3D11_USAGE_DYNAMIC
and non-zero bind flags. Both GPU and CPU access to these resources will be inefficient.- Prefer
D3D11_USAGE_DEFAULT
textures and useUpdateSubresource
for updates. - Textures with no bind flags as well as
D3D11_USAGE_STAGING
resources are unaffected by this.
- Prefer
- Avoid mapping textures with
D3D11_USAGE_DEFAULT
.- We either use staging buffers or linear images to implement this, both of which have drawbacks.
- Prefer
UpdateSubresource
for texure updates. - Prefer a separate
D3D11_USAGE_STAGING
resource for CPU read-back.
- Do not create textures with
DXGI_FORMAT_R32G32B32_*
. Vulkan driver support for these formats is limited. Note: This does not apply to using these formats for vertex buffers or buffer views. - Avoid
MAP_WRITE
on staging resources that are still in use by the GPU. DXVK will try to avoid stalls by creating a local copy of the resource, but this comes at the cost of both CPU performance and memory usage. - Consider calling
Flush
after writing one or more staging resources that will be read back on the CPU, or after issuing a sequence of queries. This is not strictly required and DXVK has heuristics to submit shortly after such an operation anyway, however explicitly callingFlush
may lead to more consistent read-back latency and more efficient submission patterns, especially on tiling GPUs.- Avoid calling
Flush
more than 2-3 times per frame.
- Avoid calling
- Use the
StartIndexLocation
andBaseVertexLocation
parameters for draw calls instead of re-binding the same vertex or index buffers with different offsets. - Clear or discard (using
DiscardView
) render targets when binding them for rendering for the first time within a frame, if the previous contents are no longer needed. This saves CPU work and may enable some driver optimizations compared to clearing at a different time within the frame. Prefer the following pattern:context->OMSetRenderTargets(n, rtvs, dsv); context->ClearDepthStencilView(dsv, ...); context->ClearRenderTargetView(rtvs[0], ...); context->ClearRenderTargetView(rtvs[1], ...); ... context->Draw(...);
- Resolve multisampled render targets immediately after the render pass, not later.
- This opens the door for more efficient multisampling on tiling GPUs.
- Also consider calling
DiscardView
on any render target or depth buffer whose contents are no longer needed immediately after rendering. If done correctly, tiling GPUs can skip writing out image contents to memory. - Custom resolves are unaffected by this, only
ResolveSubresource
can be optimized. Desktop GPUs are unaffected.
- Use
GenerateMips
rather than a custom render pass to generate mip maps if linear filtering is sufficient for your application. - When using indirect draws with no state changes in between, keep the stride between multiple draw arguments consistent.
This way, we can merge consecutive indirect draws into a single
vkCmdDrawIndexedIndirect
orvkCmdDrawIndirect
call.
- Batch draw calls that use the same set of shaders and render state. Switching Vulkan pipelines is expensive and may affect both CPU and GPU performance. The following methods can trigger a pipeline swap:
-
Set*Shader
. -
IASetInputLayout
. -
IASetPrimitiveTopology
. -
OMSetBlendState
if the blend state object or the sample mask change. Changing the blend factor is cheap. -
OMSetDepthStencilState
if the depth-stencil state object changes. Changing the stencil reference is cheap. -
RSSetState
if depth bias enablement changes.
-
- For rasterizer states, avoid setting
FillMode
toD3D11_FILL_WIREFRAME
. Doing so forces us to compile Vulkan pipelines at draw time, which may cause stutter.
- Use Deferred Contexts for multithreaded rendering if your application is otherwise bound by its own rendering thread.
- Keep the number of
ID3D11CommandList
objects used in a frame reasonably small (~a few dozen), and record at least 50-100 draws into each command list.- DXVK internally buffers pre-processed commands in 16 kiB chunks of memory; this way, at least one such chunk should be well utilized.
- Calling
ExecuteCommandLists
itself is very cheap since our implementation will just forward those chunks to a background worker.
- Start executing command lists early within a frame in order to keep overall render latency low.
- Release
ID3D11CommandList
objects immediately afterExecuteCommandList
so that resources held by those command lists can be released early. Note: This is especially important when uploading large amounts of data viaUpdateSubresource
. - Set the
RestoreContextState
parameter toFALSE
in bothExecuteCommandList
andFinishCommandList
. Fully restoring context state is expensive on the CPU and may pessimize resource lifetime tracking. - DXVK will expose native command list support by default, regardless of the GPU vendor. Use the appropriate
D3D11_FEATURE_THREADING
query to determine support rather than making assumptions based on vendor IDs.
- Use waitable swap chains in order to control frame pacing. As of DXVK 2.3, this is expected to yield the best results on platforms that properly implement
VK_KHR_present_wait
. - Present statistics are implemented, but may behave differently from native DXGI since DXVK has no information about precise VBlank timings.
While partially supported in Proton, it is recommended to avoid relying on vendor extensions for the following reasons:
- Only NVAPI works reliably in Proton. AMDAGS is partially supported, but the application must not statically link against the AGS library.
- Some features, such as shader intrinsics, are unsupported.
- Vendor extensions do not work at all when using DXVK on Windows, or on non-AMD/Nvidia GPUs, e.g. on ARM systems.
Important: Be robust against failures when loading and initializing vendor libraries, as well as error returns when using any given feature from these libraries. The Proton implementations of these libraries are not feature-complete.
DXVK tracks resource access at a subresource granularity for images, and at a byte range granularity for buffers. If two dispatches have the same set of UAVs bound, or otherwise use UAVs that access overlapping memory regions, D3D11 requires us to assume a data dependency and insert a barrier.
In some cases, this can be avoided:
- For buffers, if the exact buffer region written by any given draw or dispatch is known up front, bind only that region as a UAV rather than using an offset into a larger view. Otherwise, DXVK cannot prove that writes will not overlap.
- If all overlapping accesses are done via the same atomic operation, and the shader does not use the result of said operation, then the effect of executing commands out-of-order cannot be observed, and DXVK will not insert a barrier.
In HLSL terms, this works with
InterlockedAdd
,And
,Max
,Min
,Or
, andXor
. - If all overlapping accesses write the same a constant. For typed UAVs, this only works if the constant is zero.
In case your algorithms requires UAV overlap to be efficient, consider using DXVK's own extension interface, ID3D11VkExtContext.
- This can be queried from any
ID3D11DeviceContext
. -
BeginUAVOverlap
translates toSetBarrierControl(D3D11_VK_BARRIER_CONTROL_IGNORE_WRITE_AFTER_WRITE)
-
EndUAVOverlap
translates toSetBarrierControl(0)
Inside a UAV overlap region, DXVK will only insert a barrier between draws or dispatches if:
- Any SRV or read-only UAV in the current draw or dispatch has been written by a previous draw or dispatch, or
- Any writable UAV in the current draw has been used as an SRV or read-only UAV in a previous draw or dispatch.
Calls to ClearUnorderedAccessView
and any other copy or clear command are always synchronized with overlapping draws and dispatches, regardless of UAV overlap.
DXVK will by default not expose debug interfaces such as ID3DUserDefinedAnnotation
. To enable this, run your application through RenderDoc and set the environment variable DXVK_DEBUG=markers
.