Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile bug: new mma fattn uses constructs that can not be unrolled by llvm casueing a tonne of warnings #11602

Closed
IMbackK opened this issue Feb 2, 2025 · 5 comments · Fixed by #11604
Assignees

Comments

@IMbackK
Copy link
Collaborator

IMbackK commented Feb 2, 2025

Git commit

90f9b88

Operating systems

Linux

GGML backends

CUDA, HIP

Problem description & steps to reproduce

this construct https://github.com/ggerganov/llama.cpp/blob/90f9b88afb6447d3929843a2aa98c0f11074762d/ggml/src/ggml-cuda/fattn-common.cuh#L553 can not be unrolled by llvm for gpu targets (ie amdgcn) when ne01 is unkown at compile time, at the moment this causes several hundred warnings (one set for each arch) when compiling for rocm, please silence this like done for https://github.com/ggerganov/llama.cpp/blob/90f9b88afb6447d3929843a2aa98c0f11074762d/ggml/src/ggml-cuda/softmax.cu#L18

First Bad Commit

No response

Compile command

any

Relevant log output

llama.cpp/ggml/src/ggml-cuda/template-instances/../fattn-common.cuh:523:24: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Feb 2, 2025

What do you mean, it cannot be unrolled? The loop itself has a fixed length, the break should just be a conditional GOTO to the end of the loop and I don't see why this condition couldn't be dynamic.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Feb 2, 2025

it could be, but the compiler is too dumb

this construct with break is not supported.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Feb 2, 2025

How sure are you that the cuda compiler can unroll this loop btw, since no compiler can reasonably unroll the for (int col0 = 0; col0 < ncols; col0 += block_size) when ncols is not constexpr as is the case in softmax, which apparently dident trigger a warning on cuda either.

@JohannesGaessler
Copy link
Collaborator

With the softmax kernel it makes sense that the compiler would issue a warning since for the case with ncols == 0 since in that case the size of the loop is actually unknown. I interpreted your comments in #11471 to mean that it is only the template specialization with ncols == 0 where warnings are issued.

In the new stream-k fixup I don't understand the warning at all. If you look at the stream-k fixup in mmq.cuh it is extremely similar and should also cause warning spam if those loops cannot actually be unrolled; in fact this is a very common pattern in the code that I've written. I tried compiling the new fixup kernel with and without #pragma unroll and got the same PTX code. But unfortunately that is an inconclusive result because the compiler can decide to unroll loops on its own. For now I think we should just suppress the warning, if the loop cannot be successfully unrolled that is at most a performance problem.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Feb 2, 2025

With the softmax kernel it makes sense that the compiler would issue a warning since for the case with ncols == 0 since in that case the size of the loop is actually unknown. I interpreted your comments in #11471 to mean that it is only the template specialization with ncols == 0 where warnings are issued.

Yes that is correct.

So the problem is, that in a previous pass llvm has folded the loop-with-break into a loop with adjusted bounds based on the location of the break. When in a later pass the loop is then supposed to be unrolled the bounds of the loop are no longer be constexpr when the condition for branch with the break is not constexpr. Yes this is dumb, but thats how llvm works at the moment.

the case in softmax.cu works fine beacuse the condition of the branch with the break is known at compile time (ie never) when ncols_template is not zero.

In mmq.cuh i cant find a location in the code where you break out of a loop in a branch whos evaluation is not known at compile time so is fine there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants