Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[naga msl-out] Unbounded loop workaround is not adequate #6528

Closed
2 tasks
jimblandy opened this issue Nov 12, 2024 · 25 comments · Fixed by #6545
Closed
2 tasks

[naga msl-out] Unbounded loop workaround is not adequate #6528

jimblandy opened this issue Nov 12, 2024 · 25 comments · Fixed by #6545
Labels
area: naga back-end Outputs of naga shader conversion backend: metal Issues with Metal naga Shader Translator

Comments

@jimblandy
Copy link
Member

jimblandy commented Nov 12, 2024

I'll provide more details later, but the LOOP_IS_REACHABLE kludge has been found to be insufficient at avoiding UB and security problems. Basically, even with the kludge applied, the loop is still infinite, and still reached, so it's just a matter of being clever.

  • msl
  • hlsl
@jimblandy jimblandy added backend: metal Issues with Metal area: naga back-end Outputs of naga shader conversion naga Shader Translator labels Nov 12, 2024
@jimblandy
Copy link
Member Author

Tint's latest iteration is here or here.

Suppose the user presents you with a loop like this, where i is a u32:

loop {
    if (i < 10) {
        a[i] = b[i];
        break;
    }
}

The Metal Shading Language compiler is allowed to assume that all loops terminate. (This is just another way of saying, "infinite loops are UB".)

For that to be the case, we must reach the break, so we know that i < 10 when we reach the if. Thus, the entire loop can be rewritten as:

a[i] = b[i];

Since i is a u32, the compiler infers that 0 <= i < 10 when the assignment is reached. This means that if a and b are arrays of length 10 or greater, the compiler knows that i is in bounds, and thus can discard bounds checks injected by Naga.

Again, the reason the existing LOOP_IS_REACHABLE kludge doesn't suffice is that it merely persuades MSL that, since there is a branch over the infinite loop, conditions sufficient to lead to that branch aren't necessarily true. But the infinite loop is still there, so MSL is still free to infer other things based on that.

@jimblandy
Copy link
Member Author

jimblandy commented Nov 13, 2024

The fix is to generate code like this:

while (true) {
    // body of loop
    {volatile bool VOLATILE_NAME = false; if (VOLATILE_NAME) break;}
}

This forces the MSL compiler to assume that the loop does terminate, thus preventing it from assuming that other conditions leading to other breaks must be true.

This will probably have even more of a performance impact than our current LOOP_IS_REACHABLE kludge, since it's inside the loop, not outside it.

@teoxoy
Copy link
Member

teoxoy commented Nov 13, 2024

This is interesting, so the fact that "The Metal Shading Language compiler is allowed to assume that all loops terminate." + the fact that "indexing out of bounds is UB" (presumably allowing the compiler to also assume it doesn't happen) allows for this sort of optimizations where the bounds checks we inject are thrown away.

I finally understand the situation. I didn't think that UB could mean "assume it never happens" but I guess that's valid. This is unfortunate as we have to cope with this sort of ramifications.

@teoxoy
Copy link
Member

teoxoy commented Nov 13, 2024

Thinking about this some more, I always thought of UB as strictly happening at runtime (ex: if you index out of bounds at runtime, here be dragons). If that were the case, the optimization happening here would have been a bug.

That's clearly not the case https://en.cppreference.com/w/cpp/language/ub.
I'm not sure how C++ developers deal with this.

@ErichDonGubler
Copy link
Member

@teoxoy: It is with a certain dubious delight that I present this old classic for your reading: Undefined behavior can result in time travel

@teoxoy
Copy link
Member

teoxoy commented Nov 14, 2024

I wanted to see the effect of this for myself: https://shader-playground.timjones.io/b13aed3846180491656e076460d45bce

It looks like the bound check is still in the output of the Metal compiler, the loop is gone and the if became:

  %5 = icmp ult i32 %1, 10
  call void @llvm.assume(i1 %5)

@llvm.assume is documented as:

Allows the optimizer to assume that the provided condition is true. This information can then be used in simplifying other parts of the code.

And I'm guessing that the optimizer will see that i is at most 9 and will replace the bound check (metal::min(i, 15u - 1)) with i.

Adding

{volatile bool VOLATILE_NAME = false; if (VOLATILE_NAME) break;}

to the end of the loop or using a volatile bool as the loop condition fixes the issue; the current LOOP_IS_REACHABLE macro doesn't (the @llvm.assume is still in the output).

@teoxoy
Copy link
Member

teoxoy commented Nov 14, 2024

I was curious to see what DXC does: https://shader-playground.timjones.io/9eb4af3bb2082c18cbd216659849eea1

It seems like it removes the loop and the if altogether but makes no assumptions about the value of i. This will preserve the bounds checks but could be viewed as a bug from the perspective of a WGSL developer since code will get executed regardless of the value of i. The WGSL spec doesn't disallow infinite loops unless they are trivial.

It seems to me that all of these headaches would be solved if we'd add a generous upper bound (u32::MAX?u64::MAX?) on the number of loop iterations in the WGSL spec. It could be that by the time code reaches one of these upper bounds, the driver would lose the device anyway.

@teoxoy
Copy link
Member

teoxoy commented Nov 14, 2024

@magcius also mentioned in one of the chats that SPIR-V considers infinite loops UB but I couldn't find info on that. Jasper, do you have some links/references?

jimblandy added a commit to jimblandy/wgpu that referenced this issue Nov 14, 2024
In MSL output, avoid undefined behavior due to unbounded loops by
adding an unpredictable, never-actually-taken `break` to the bottom of
each loop body, rather than adding an unpredictable,
never-actually-taken branch over each loop.

This will probably have more of a performance impact, because it
affects each iteration of the loop, but unlike branching over the
loop, which leaves infinite loops (and thus undefined behavior) in the
output, this actually ensures that no loop presented to Metal is
unbounded, so that there is no undefined behavior present that the
optimizer could use to make unwelcome inferences.

Fixes gfx-rs#6528.
@jimblandy
Copy link
Member Author

jimblandy commented Nov 14, 2024

It seems to me that all of these headaches would be solved if we'd add a generous upper bound (u32::MAX?u64::MAX?) on the number of loop iterations in the WGSL spec. It could be that by the time code reaches one of these upper bounds, the driver would lose the device anyway.

OMG THIS IS BRILLIANT

This may well be the Right Fix for this problem. My theory is that the biggest perf impact is not actually the dumb never-taken branch at runtime, but the fact that the volatile makes it impossible for the compiler to tell how many times the loop will execute, so it can't unroll loops, or things like that.

But notice that, before the compiler unrolls a loop, it must first know how many iterations it will take. That implies that if we transform every loop into:

var limit = 0xffffffffu;
while (limit-- > 0) {
    // loop body
}

then the compiler may well still be able to detect that the loop will only run (say) four times anyway, and thus it can still unroll it. In any case where the optimizer can determine that the loop will exit for other reasons, it will boil away limit entirely. For example, I would absolutely expect LLVM to be able to turn this:

var i = 0;
var limit = 0xffffffffu;
while (limit -- > 0) {
    if (i >= 4) {
        break;
    }
    f(a[i]);
}

into

f(a[0]);
f(a[1]);
f(a[2]);
f(a[3]);

and I bet these simple cases are the kind of thing that are actually hurting performance. With the volatile kludge, the compiler can't do a darned thing.

@jimblandy
Copy link
Member Author

The spec doesn't give us permission to put bounds on the loops like that - but any driver is going to kill us before we execute for too long anyway. And we can put a ridiculous limit like 0xffff_ffff_ffff_ffff on there, that's fine: the plan is we're never going to actually reach it anyway. It simply ensures that we're not writing code with UB. WGSL doesn't have u64, but MSL does have uint64_t.

@jimblandy
Copy link
Member Author

@magcius also mentioned in one of the chats that SPIR-V considers infinite loops UB but I couldn't find info on that. Jasper, do you have some links/references?

I don't remember @magcius saying this, I thought it was @DemiMarie. She tried to chase down the sources but we didn't get a reply.

@jimblandy
Copy link
Member Author

Note that Tint doesn't have any similar workarounds in their SPIR-V backend. I would expect them to have mentioned this if it were needed. And I can't find anything in the SPIR-V or Vulkan specs that suggests that infinite loops are UB.

@jimblandy
Copy link
Member Author

jimblandy commented Nov 14, 2024

According to godbolt.org, GCC compiles this C++:

void f(int *a) {
    int limit = 0xffff;
    for (int i = 0; i < 4; i++) {
        a[i]++;
        if (--limit == 0)
            break;
    }
}

to this x86_64 assembly:

        movdqu  xmm1, XMMWORD PTR [rdi]
        mov     eax, 1
        movd    xmm0, eax
        pshufd  xmm0, xmm0, 0
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi], xmm0
        ret

It's unrolling the entire four-iteration loop into a non-looped series of SIMD instructions. Clang does something similar.

@DemiMarie
Copy link

@magcius also mentioned in one of the chats that SPIR-V considers infinite loops UB but I couldn't find info on that. Jasper, do you have some links/references?

I don't remember @magcius saying this, I thought it was @DemiMarie. She tried to chase down the sources but we didn't get a reply.

It was a discussion in Mesa, but I think the conclusion was that this is allowed in SPIR-V and so such optimizations would be Mesa bugs.

@cwfitzgerald
Copy link
Member

I think we should either do u32::MAX, or we should do two counters of 32bit, as I'm worried about the performance implications of 64bit arithmetic, especially on gpus where it's 1/64th rate

@cwfitzgerald
Copy link
Member

Also the limit check needs to be at the very beginning of the loop, before any user code, or else a while (true) { continue; } would still be infinite

@DemiMarie
Copy link

I think we should either do u32::MAX, or we should do two counters of 32bit, as I'm worried about the performance implications of 64bit arithmetic, especially on gpus where it's 1/64th rate

Which GPUs are these?

@jimblandy
Copy link
Member Author

It's easy enough to chain them together:

var lower: u32 = 0xffff_ffff;
var higher: u32 = 0xffff_ffff;
...
higher -= select(1, 0, lower == 0);
lower -= 1;
if ((lower | higher) == 0) {
    break;
}

But are there really ever loops that run 2^32 iterations in a shader? I'm skeptical.

@jimblandy
Copy link
Member Author

It seems like it removes the loop and the if altogether but makes no assumptions about the value of i. This will preserve the bounds checks but could be viewed as a bug from the perspective of a WGSL developer since code will get executed regardless of the value of i. The WGSL spec doesn't disallow infinite loops unless they are trivial.

This transformation does make me uncomfortable as well. I'm unhappy about these dumb checks creeping in everywhere, but given that the compiler is already doing surprising things, I just don't think we can anticipate what the consequences would be for other, more creatively constructed shaders.

So I think we may need to do this dumb loop bounds injection kludge for HLSL as well.

@DemiMarie
Copy link

Would it be reasonable to add this to SPIR-V?

@jimblandy
Copy link
Member Author

I've been looking at conversation like this, and it does not leave me feeling confident that we can assume unbounded loops will be handled correctly in SPIR-V. I'd like to hear from someone more authoritative, and ideally learn where the spec actually weighs in on this. But we should be prepared.

@github-project-automation github-project-automation bot moved this from Todo to Done in WebGPU for Firefox Nov 18, 2024
@teoxoy
Copy link
Member

teoxoy commented Nov 20, 2024

I found the internal Khronos issue (https://gitlab.khronos.org/cross-api/memory-model/-/issues/145), something that came out of it is https://github.com/KhronosGroup/Vulkan-Docs/blame/a61256a946d2b48b9efdd3a18c3fd4c4c354b949/chapters/commonvalidity/draw_dispatch_common.adoc#L519-L520.

So it's not spelled out that infinite loops are UB but it can be inferred since all invocations must terminate.

@teoxoy
Copy link
Member

teoxoy commented Nov 20, 2024

Other than putting an upper limit on the number of loop iterations I don't know how else we would be able to avoid the issue for SPIR-V.

@DemiMarie
Copy link

Other than putting an upper limit on the number of loop iterations I don't know how else we would be able to avoid the issue for SPIR-V.

Could Naga use this to support preempting a shader? There are algorithms to transform a program in such a way that it is guaranteed to reach a safepoint in a bounded amount of time. Those are usually used by language runtimes to support garbage collection or lightweight thread preemption, but they could also be used to preempt a shader before the driver gives up and resets the GPU. Resetting the GPU can often have unwanted knock-on effects, such as crashing the host compositor, so it is definitely something to avoid if possible.

For native API users, this could also allow supporting long-running compute. Instead of terminating the shader, the code would save all shader state to a buffer, from which it could be restored later. This would allow long running compute jobs on hardware and drivers that do not natively support them.

@teoxoy
Copy link
Member

teoxoy commented Nov 20, 2024

I guess we could do that but it sounds non-trivial, it would be worth opening a new issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: naga back-end Outputs of naga shader conversion backend: metal Issues with Metal naga Shader Translator
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants