-
Notifications
You must be signed in to change notification settings - Fork 645
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Erase all address spaces and get inlined ukernels (#19646)
The `LLVMGPUCastAddressSpaceFunction` pass was selectively erasing the shared memory address space from pointers around Call ops to achieve inlining. This PR generalizes that to erasing all address spaces after checking with its original author that there wasn't anything intentional here: [discord](https://discord.com/channels/689900678990135345/1282818085153407038/1326577591557296272) This has the intended effect of allowing AMDGPU ukernels to get inlined into their callers. There is a side benefit of not having to duplicate ukernels for the various combinations of address spaces of their pointer parameters. This benefit will be partly rolled back if and when we do assembly ukernels, as these will need to know the address spaces to write different instructions, but at least for C ukernels it is nice. It was counter-intuitive to me that erasing address spaces was possible at all. The key is that these ukernels only get compiled to LLVM IR, not to ISA, and the resulting IR gets inlined into a caller where the addrspacecast was done and where the actual address space is known. After inlining, the compiler is still able to propagate the actual address spaces all the way into the inlined ukernel code. For the current `multi_mma` ukernel there was no immediate problem. The changes to it in this PR are reaping the benefits of inlining: now the `unroll_*` parameters become compile-time constants after inlining so we get to simply declare our accumulator tile as a VLA and let it get specialized to a normal fixed-size array. No need anymore to use an arbitrary fixed size array and try to guard that with assertions. For the exising `argmax` ukernels, the inlining revealed a preexisting issue: these ukernels are reductions to a single scalar and instead of returning it by value, write their result value to an output buffer (which happens to be LDS memory, but the address space doesn't matter). The problem was that there was no synchronization between the thread writing the value in the ukernel, and the threads reading the value in the caller. Solved by adding a `__threadfence_block()`, which compiles to almost nothing in ISA (s_waitcnt, which we have anyway around memory accesses) but prevents IR rewrites removing the loads from the output buffer. I added `__threadfence_block()` to common.h, copied from AMD device library headers, along with a few other synchronization functions which we anticipate will be useful in other ukernels. `__syncthreads` is not used in this PR. Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
- Loading branch information
Showing
7 changed files
with
79 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters