Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move GPU ukernel selection to KernelConfig #19440

Merged
merged 3 commits into from
Dec 15, 2024
Merged

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Dec 10, 2024

This moves the logic deciding whether an op should be a ukernel out of the GPULowerToUKernels pass, into KernelConfig.

So KernelConfig decides whether the op should be a ukernel, and encodes that into the resulting lowering_config, in a new parameter, that is a new attribute, UKernelSpecAttr. That attribute is directly modeled after the equivalent C++ data structure that we have had in LowerToUKernels passes, FnNameAndDefAttrs, which it replaces. If the attribute is present, it means that the op was selected for ukernel lowering, with the fields telling the ukernel name and some function definition attributes (to import any dependencies, such as the rocm module for runtime support symbols).

All the details about supplying the ukernel bitcode in a hal.executable.object are also moved there, becoming a side effect of KernelConfig.

The GPULowerToUKernels becomes much simpler, since all the decision-making was already done for it. It just looks at the LoweringConfigAttr and if it's there, it performs the requested lowering.

The motivation for this split is that we need to know in KernelConfig whether it's going to be a ukernel, because ops that will get lowered to a ukernel require a different configuration. The important example for us is multi_mma, which in the ukernel case needs to avoid reduction-dimension tiling to 1 so that the ukernel gets to see the reduction loop.

A few simplifications arise already in the current argmax ukernel logic, confirming that this was the right design choice: the old ukernel's matching logic was checking that the distribution tile sizes matched what the ukernel could handle; now that is turned upside down: the ukernel matching happens as a helper within KernelConfig where we know we are setting the appropriate tile sizes on purpose.

Another nice improvement is that this puts just enough distance between ukernel selection (which creates the hal.executable.object) and ukernel lowering, that we are able to insert HoistExecutableObjectsPass in between, simplifying the ukernel lowering as it doesn't need to worry anymore about preserving the hal.executable.object.

@bjacob bjacob changed the title GPUSelectUKernelsPass Move GPU ukernel selection to KernelConfig Dec 12, 2024
@bjacob bjacob marked this pull request as ready for review December 12, 2024 21:55
@hanhanW hanhanW self-requested a review December 13, 2024 03:07
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialization/lookup logic looks good to me.

Why do we use iree_codegen.lowering_config instead of iree_gpu.lowering_config?

@bjacob
Copy link
Contributor Author

bjacob commented Dec 14, 2024

The serialization/lookup logic looks good to me.

Why do we use iree_codegen.lowering_config instead of iree_gpu.lowering_config?

good idea, done.

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@bjacob bjacob requested a review from kuhar December 14, 2024 03:22
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few suggestions

MLIRContext *context = op->getContext();
Builder b(context);
SmallVector<NamedAttribute, 2> attrs;
attrs.emplace_back(StringAttr::get(context, "workgroup"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we have a helper that map tiling level enums to the exact string. This should be make future refactoring easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be nice, but I'm tempted to leave this out (I couldn't immediately find the helper, and the VPN is disconnecting every minute...)


namespace {

static StringLiteral executableObjectsAttrName = "hal.executable.objects";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static StringLiteral executableObjectsAttrName = "hal.executable.objects";
constexpr StringLiteral executableObjectsAttrName = "hal.executable.objects";

constexpr in global scope already implies static

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Note, I realized that it is nontrivial whether StringLiteral is a good thing to use here. As this constructor takes the address of a string literal, it could in principle cause a relocation, or whatever is responsible, at DSO load time, of computing the actual pointer value, as in https://glandium.org/blog/?p=2361 . I suppose that either static or constexpr allows the compiler to reason that the string doesn't escape, making this a non-issue when all the uses are evaluated at compile-time. But here, it's not clear that they are, this depends on the functions that we call, all getting inlined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to design a StringLiteral class that doesn't have this problem. It's just that it needs to store the const char[N] by value, and be templated in N, instead of storing a (inheriting from) StringRef.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that static constexpr always meant no dynamic initialization?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much the intended usecase for StringLiteral -- constant names that eventually get passed as StringRef (without the need of ever calling strlen).

Copy link
Contributor Author

@bjacob bjacob Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever address-of is "constexpr", there is, one way or another, a hidden caveat. It simply isn't possible to tell the address at compile time. What can be known at compile time is the offset at which something is allocated relatively to the start of a segment, but the segment base address won't be known until the DSO is loaded. So at runtime, once that is known, the base+offset addition can be computed, and its result written as the actual initializer for such pointer constants. IIUC that is what toolchain experts call a "relocation", but I am talking out of my depth here, so please feel free to point out any misunderstanding.

Avoiding strlen only matters if the strlen can't be constant evaluated. When the string is a char char [], it can be const evaluated.

I looked up the Phabricator review that introduced StringLiteral. The stated motivation is to avoid constructors (ie strlen) in arrays of string literals. That almost makes sense: the idea seems to be unless they all have the same length, strings in a plain array of string literals can't be stored by value, so one is led to storing an array of char pointers, and then strlen can't constant evaluate anymore. What I think that is missing, is that one could simply have an array of StringRef! At least in sufficiently recent compilers, the usual StringRef constructor would consteval the strlen. It's possible that 8 years ago when StringLiteral was introduced, it didn't. I think StringLiteral is just a historical workaround for that. Instead of subclassing StringRef with a template constructor, it should just be a template constructor on StringRef. And then on modern compilers we would see that it doesn't matter anymore which constructor is used.

Anyway, the rationale for StingLiteral is specific to arrays of StringLiteral. It isn't applying here, as we have a single string.

Also, an array of StringLiteral only solves the constructors problem but leaves the (less severe) relocation problem. A solution for arrays of string literals that would not have relocations would be a helper to store the string literals as (constexpr evaluated) concatenated string plus a "TOC" array of offsets.

Copy link
Contributor Author

@bjacob bjacob Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried all possible combinations of storage class and type for this executableObjectsAttrName,

{static, constexpr, none} x {StringLiteral, StringRef, const char[], const char*}

and that makes no difference to the number of relocations in libIREECompiler.so. Probably because regardless of storage class, the uses within the same translation unit get rewritten to just the initializer value. It's UB anyway to mutate a static-duration const object, so the rewrite is legal.

Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@bjacob bjacob merged commit dc29ee7 into iree-org:main Dec 15, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants