-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi_draw_indirect_* on Metal #2148
Comments
Hm so I tried to just add the feature flag to the accepted list of feature flags for Metal, and it seemed to just work (I'm running that on a MacBook Air 2020, big sur): FredrikNoren@7da68cb |
This one does indeed work, but the unsafe fn draw_indirect_count(
&mut self,
_buffer: &super::Buffer,
_offset: wgt::BufferAddress,
_count_buffer: &super::Buffer,
_count_offset: wgt::BufferAddress,
_max_count: u32,
) {
//TODO
}
unsafe fn draw_indexed_indirect_count(
&mut self,
_buffer: &super::Buffer,
_offset: wgt::BufferAddress,
_count_buffer: &super::Buffer,
_count_offset: wgt::BufferAddress,
_max_count: u32,
) {
//TODO
} We could try to implement it. Or we could try to split the feature in 2 (count and non-count). |
Ah I didn't realize we emulated multi-draw-indirect on metal with a for loop of single indirects. Using this idea, the way we could implement MDIC without a cpu readback is by dispatching a compute shader which copies and zeros out the dispatches above the count. This is getting dangerously close to "emulation" and I would rather just split the feature. Edit: It's already split, there are two features: MDI and MDIC |
To add some more to this issue; this is now my #1 CPU blocker on Metal: On Windows, the CPU is basically idle for the same scene (~15ms, "drop render pass" doesn't even show up). This is my rendering code: #[cfg(not(target_os = "macos"))]
{
render_pass.multi_draw_indexed_indirect(
&cull_state.commands.buffer(),
(offset * std::mem::size_of::<DrawIndexedIndirect>() as u64),
mat.entities.len() as u32,
);
}
#[cfg(target_os = "macos")]
{
for i in 0..mat.entities.len() {
render_pass
.draw_indexed_indirect(&cull_state.commands.buffer(), (offset + i as u64) * std::mem::size_of::<DrawIndexedIndirect>() as u64);
}
} I was looking into if I could try to add Indirect Command Buffers to the metal hal myself, but I'm honestly quite lost. Any pointers or suggestions would be appreciated! Or if there's some other way to reduce the CPU overhead (I was looking at the instance_count field but really not sure how I could use that). |
@FredrikNoren thank you for sharing! The Indirect Command Buffers (ICB) - we wanted to use them as implementation for GPU RenderBundle things. So the road to this would be first adding a feature flag (internally used by wgpu-hal) for native render bundle support, adding the relevant API in wgpu-hal, implementing it on Metal, and then hooking up the real WebGPU's RenderBundle logic to this wgpu-hal feature (if supported). It's a relatively big chunk of work. Perhaps you could run this workload through Metal's System Trace to see what the sampling profiler reports. Maybe indirect calls on Metal are just that slow? Or maybe we are doing something silly in wgpu-core. |
@kvark That sounds great! Would love to try to see if switching to those RenderBundles would help in the future then. Is there any timeline when that might be available? I ran it though the Metal System Trace, but to be honest I'm not 100% how to read the results. Here's the trace: https://drive.google.com/file/d/1pqWsBappybinzbDF7PTPD51VddDi7tit/view?usp=sharing |
So a bit more context here, there are two different ways that we will interact with metal indirect command buffers in wgpu. First is through render bundles. These allow the user to record a series of commands on the cpu, then get them replay them multiple times over the course of a program. This maps fairly directly to indirect command buffers and just needs the work done on it, the plan of attack has been set. If you don't need to change the volume or the arguments to the calls, this can be a good solution. Then is through the use of multi-draw indirect, these allow a shader to write out a series of draw calls that will be executed as a single "blast" of draw calls. These are most useful in cases of gpu culling or other places where the gpu wants to generate a possibly arbitrary amount of draw calls to be executed. Right now, the design of wgpu's multi-draw indirect requires that a compute shader be added in the metal backend which translates the vulkan style multi-draw-indirect buffer into the metal indirect command buffer. This is far from the most efficient implementation, and I would love to see a world where we implement something that lets shaders write directly to ICBs. |
@cwfitzgerald side note - so that's where your proposal comes in? With an indirect command encoder being an API primitive, we'll be able to implement it purely on host side without writing from compute shaders. That seems like a solid argument, although not sure if it's enough to warrant the API complication. |
Writing to ICBs from shaders sounds great! In the meantime; is there any way to circumvent webgpu to get access to metal directly, or would a fork be my best option? |
@kvark Re. the many encoders; I think it's actually just two encoders (I just open two in the code), but what's showing is each render pass. Also not sure why symbols doesn't show up, it's built with |
@kvark we already have ICBs on the host side: renderbundles (we haven't implemented it yet, but we can) My proposal was to add a multi-draw-indirect-like ICB as a wgsl buffer type so that a user's compute shader can freely write to it. This would let vulkan, dx12, and metal resolve to different ICB code (vulkan "normal" MDI buffer, dx12 a MDI buffer with 3 push constants, and metal an actual ICB). This also lets us do the bounds checking during the write, without needing a separate compute shader to validate. |
Is your feature request related to a problem? Please describe.
I'd like to implement GPU-side culling, and from what I understand
multi_draw_indirect_*
is a key component to this. It would be amazing if this could be supported on OSX as well. (The_count
versions would be even better, but as far as I understand it they're not available in Metal? I could be wrong)Describe the solution you'd like
multi_draw_indirect_*
available on OSX/MetalDescribe alternatives you've considered
I guess for now I can do a manual for-loop and issue one
draw_*_indirect
call per command (which I would still build on the GPU, for the gpu side culling), to simulatemulti_draw*
.Additional context
@cwfitzgerald, @jasperrlz and I talked about this yesterday in the matrix chat, figured I'd also post an issue to track this. Also, from the discussion in #742 it sounded like there could be a way to do it?
The text was updated successfully, but these errors were encountered: