-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[naga msl-out] Unbounded loop workaround is not adequate #6528
Comments
Tint's latest iteration is here or here. Suppose the user presents you with a loop like this, where loop {
if (i < 10) {
a[i] = b[i];
break;
}
} The Metal Shading Language compiler is allowed to assume that all loops terminate. (This is just another way of saying, "infinite loops are UB".) For that to be the case, we must reach the a[i] = b[i]; Since Again, the reason the existing |
The fix is to generate code like this: while (true) {
// body of loop
{volatile bool VOLATILE_NAME = false; if (VOLATILE_NAME) break;}
} This forces the MSL compiler to assume that the loop does terminate, thus preventing it from assuming that other conditions leading to other breaks must be true. This will probably have even more of a performance impact than our current |
This is interesting, so the fact that "The Metal Shading Language compiler is allowed to assume that all loops terminate." + the fact that "indexing out of bounds is UB" (presumably allowing the compiler to also assume it doesn't happen) allows for this sort of optimizations where the bounds checks we inject are thrown away. I finally understand the situation. I didn't think that UB could mean "assume it never happens" but I guess that's valid. This is unfortunate as we have to cope with this sort of ramifications. |
Thinking about this some more, I always thought of UB as strictly happening at runtime (ex: if you index out of bounds at runtime, here be dragons). If that were the case, the optimization happening here would have been a bug. That's clearly not the case https://en.cppreference.com/w/cpp/language/ub. |
@teoxoy: It is with a certain dubious delight that I present this old classic for your reading: Undefined behavior can result in time travel |
I wanted to see the effect of this for myself: https://shader-playground.timjones.io/b13aed3846180491656e076460d45bce It looks like the bound check is still in the output of the Metal compiler, the loop is gone and the if became: %5 = icmp ult i32 %1, 10
call void @llvm.assume(i1 %5)
And I'm guessing that the optimizer will see that Adding {volatile bool VOLATILE_NAME = false; if (VOLATILE_NAME) break;} to the end of the loop or using a volatile bool as the loop condition fixes the issue; the current |
I was curious to see what DXC does: https://shader-playground.timjones.io/9eb4af3bb2082c18cbd216659849eea1 It seems like it removes the loop and the if altogether but makes no assumptions about the value of It seems to me that all of these headaches would be solved if we'd add a generous upper bound ( |
@magcius also mentioned in one of the chats that SPIR-V considers infinite loops UB but I couldn't find info on that. Jasper, do you have some links/references? |
In MSL output, avoid undefined behavior due to unbounded loops by adding an unpredictable, never-actually-taken `break` to the bottom of each loop body, rather than adding an unpredictable, never-actually-taken branch over each loop. This will probably have more of a performance impact, because it affects each iteration of the loop, but unlike branching over the loop, which leaves infinite loops (and thus undefined behavior) in the output, this actually ensures that no loop presented to Metal is unbounded, so that there is no undefined behavior present that the optimizer could use to make unwelcome inferences. Fixes gfx-rs#6528.
OMG THIS IS BRILLIANT This may well be the Right Fix for this problem. My theory is that the biggest perf impact is not actually the dumb never-taken branch at runtime, but the fact that the volatile makes it impossible for the compiler to tell how many times the loop will execute, so it can't unroll loops, or things like that. But notice that, before the compiler unrolls a loop, it must first know how many iterations it will take. That implies that if we transform every loop into: var limit = 0xffffffffu;
while (limit-- > 0) {
// loop body
} then the compiler may well still be able to detect that the loop will only run (say) four times anyway, and thus it can still unroll it. In any case where the optimizer can determine that the loop will exit for other reasons, it will boil away var i = 0;
var limit = 0xffffffffu;
while (limit -- > 0) {
if (i >= 4) {
break;
}
f(a[i]);
} into f(a[0]);
f(a[1]);
f(a[2]);
f(a[3]); and I bet these simple cases are the kind of thing that are actually hurting performance. With the |
The spec doesn't give us permission to put bounds on the loops like that - but any driver is going to kill us before we execute for too long anyway. And we can put a ridiculous limit like |
I don't remember @magcius saying this, I thought it was @DemiMarie. She tried to chase down the sources but we didn't get a reply. |
Note that Tint doesn't have any similar workarounds in their SPIR-V backend. I would expect them to have mentioned this if it were needed. And I can't find anything in the SPIR-V or Vulkan specs that suggests that infinite loops are UB. |
According to godbolt.org, GCC compiles this C++: void f(int *a) {
int limit = 0xffff;
for (int i = 0; i < 4; i++) {
a[i]++;
if (--limit == 0)
break;
}
} to this x86_64 assembly:
It's unrolling the entire four-iteration loop into a non-looped series of SIMD instructions. Clang does something similar. |
It was a discussion in Mesa, but I think the conclusion was that this is allowed in SPIR-V and so such optimizations would be Mesa bugs. |
I think we should either do u32::MAX, or we should do two counters of 32bit, as I'm worried about the performance implications of 64bit arithmetic, especially on gpus where it's 1/64th rate |
Also the limit check needs to be at the very beginning of the loop, before any user code, or else a |
Which GPUs are these? |
It's easy enough to chain them together: var lower: u32 = 0xffff_ffff;
var higher: u32 = 0xffff_ffff;
...
higher -= select(1, 0, lower == 0);
lower -= 1;
if ((lower | higher) == 0) {
break;
} But are there really ever loops that run 2^32 iterations in a shader? I'm skeptical. |
This transformation does make me uncomfortable as well. I'm unhappy about these dumb checks creeping in everywhere, but given that the compiler is already doing surprising things, I just don't think we can anticipate what the consequences would be for other, more creatively constructed shaders. So I think we may need to do this dumb loop bounds injection kludge for HLSL as well. |
Would it be reasonable to add this to SPIR-V? |
I've been looking at conversation like this, and it does not leave me feeling confident that we can assume unbounded loops will be handled correctly in SPIR-V. I'd like to hear from someone more authoritative, and ideally learn where the spec actually weighs in on this. But we should be prepared. |
I found the internal Khronos issue (https://gitlab.khronos.org/cross-api/memory-model/-/issues/145), something that came out of it is https://github.com/KhronosGroup/Vulkan-Docs/blame/a61256a946d2b48b9efdd3a18c3fd4c4c354b949/chapters/commonvalidity/draw_dispatch_common.adoc#L519-L520. So it's not spelled out that infinite loops are UB but it can be inferred since all invocations must terminate. |
Other than putting an upper limit on the number of loop iterations I don't know how else we would be able to avoid the issue for SPIR-V. |
Could Naga use this to support preempting a shader? There are algorithms to transform a program in such a way that it is guaranteed to reach a safepoint in a bounded amount of time. Those are usually used by language runtimes to support garbage collection or lightweight thread preemption, but they could also be used to preempt a shader before the driver gives up and resets the GPU. Resetting the GPU can often have unwanted knock-on effects, such as crashing the host compositor, so it is definitely something to avoid if possible. For native API users, this could also allow supporting long-running compute. Instead of terminating the shader, the code would save all shader state to a buffer, from which it could be restored later. This would allow long running compute jobs on hardware and drivers that do not natively support them. |
I guess we could do that but it sounds non-trivial, it would be worth opening a new issue for it. |
I'll provide more details later, but the
LOOP_IS_REACHABLE
kludge has been found to be insufficient at avoiding UB and security problems. Basically, even with the kludge applied, the loop is still infinite, and still reached, so it's just a matter of being clever.The text was updated successfully, but these errors were encountered: