HIP: add doc on small default launch bounds #11619

fxzjshm · 2025-02-03T11:45:20Z

Related: #10610

Related: ggml-org#10610 Signed-off-by: fxzjshm <fxzjshm@163.com>

IMbackK · 2025-02-03T13:21:11Z

No, I dont get why this would happen, how did the launch bounds end up 256? the max/default value on for amd targets is 1024. The only way you would end up in this situation is if you somehow set --gpu-max-threads-per-block=256 right before setting it back to 1024

fxzjshm · 2025-02-03T13:47:01Z

DCU SDK is not from AMD so default behavior is different.
Unfortunately little info is available due to policy of this vendor.

I've checked with this simple vector add:

vecadd.cpp

#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define page_size  1024
#define threads_per_block 1024


void hipCheck(hipError_t err, const char *file, int line) {
    if (err != hipSuccess) {
        printf("%s:%d: ", file, line);
        printf("HIP error: %s\n", hipGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    hipError_t err;

    hipDeviceProp_t devProp;
    hipGetDeviceProperties(&devProp, 0);
    printf(" agent prop name %s\n", devProp.name);

    float *A = (float *)malloc(page_size * sizeof(float));
    float *B = (float *)malloc(page_size * sizeof(float));
    float *C = (float *)malloc(page_size * sizeof(float));

    for (int i = 0; i < page_size; i++) {
        A[i] = 1;
        B[i] = 1;
        C[i] = 0;
    }

    float *a_d, *b_d, *c_d;
    err = hipMalloc((void **)&a_d, page_size * sizeof(float));
    hipCheck(err, __FILE__, __LINE__);
    err = hipMalloc((void **)&b_d, page_size * sizeof(float));
    hipCheck(err, __FILE__, __LINE__);
    err = hipMalloc((void **)&c_d, page_size * sizeof(float));
    hipCheck(err, __FILE__, __LINE__);

    err = hipMemcpy(a_d, A, page_size * sizeof(float), hipMemcpyHostToDevice);
    hipCheck(err, __FILE__, __LINE__);
    err = hipMemcpy(b_d, B, page_size * sizeof(float), hipMemcpyHostToDevice);
    hipCheck(err, __FILE__, __LINE__);
    err = hipMemcpy(c_d, C, page_size * sizeof(float), hipMemcpyHostToDevice);
    hipCheck(err, __FILE__, __LINE__);

    int blocks_per_grid = page_size / threads_per_block;
    vectorAdd<<<blocks_per_grid, threads_per_block>>>(a_d, b_d, c_d, page_size);
    hipDeviceSynchronize();

    err = hipGetLastError();
    hipCheck(err, __FILE__, __LINE__);

    err = hipMemcpy(C, c_d, page_size * sizeof(float), hipMemcpyDeviceToHost);
    hipCheck(err, __FILE__, __LINE__);

    for (int i = 0; i < page_size; i++) {
        if (C[i] != 2.0) {
            printf("Error: C[%d] = %f instead of 2.0\n", i, C[i]);
            return EXIT_FAILURE;
        }
    }

    printf("correct\n");
    return EXIT_SUCCESS;

    free(A);
    free(B);
    free(C);
    hipFree(a_d);
    hipFree(b_d);
    hipFree(c_d);
}

Compile with no other flags:

hipcc vecadd.cpp -o vecadd

On this gfx906 variant:

 agent prop name [REMOVED]
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _Z9vectorAddPfS_S_i please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! 
correct

On gfx1100:

 agent prop name Radeon RX 7900 XTX
correct

If compile with explicitly setting max thread to 1024:

hipcc vecadd.cpp -o vecadd --gpu-max-threads-per-block=1024

then

 agent prop name [REMOVED]
correct

IMbackK · 2025-02-03T14:10:57Z

So i gues this is about the Haiguang DCU Z100L, which is just a rebranded mi50 to be shipped in servers with the Haiguang built licensed zen1 cpus.

So in the very distant past the default launch bounds in llvm where 256. It would appear like your "DCU SDK" whatever that is contains a very very old version of llvm. I would suggest you stop hurting yourself and just use upstream llvm.

Beyond that we could set gpu-max-threads-per-block=1024 in our cmake files for hip, as this wont have any effect on upstream llvm, but would fix your case.

Generally i dont really feal like catering to out of date proprietary forks of llvm.

fxzjshm · 2025-02-03T14:18:07Z

Unfortunately upstream LLVM doesn't support the product well, I've tried but no luck. I've asked them when to merge their work to upstream LLVM but they refused, just like many other companies I've got in touch with. Sad.
This SDK is the most recent version I've found, and hipcc from it claims that it is Clang 15. Not sure why max thread is still 256.

Some old compilers still use 256. Explicitly set it to 1024 to get correct result from ops like ARGMAX and GROUP_NORM. Related: ggml-org#10610, ggml-org#11619 Signed-off-by: fxzjshm <fxzjshm@163.com>

HIP: add doc on small default launch bounds

94bc968

Related: ggml-org#10610 Signed-off-by: fxzjshm <fxzjshm@163.com>

github-actions bot added the documentation Improvements or additions to documentation label Feb 3, 2025

fxzjshm mentioned this pull request Feb 3, 2025

HIP: force max threads per block to be 1024 #11621

Merged

IMbackK closed this Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: add doc on small default launch bounds #11619

HIP: add doc on small default launch bounds #11619

fxzjshm commented Feb 3, 2025

IMbackK commented Feb 3, 2025

fxzjshm commented Feb 3, 2025

IMbackK commented Feb 3, 2025 •

edited

Loading

fxzjshm commented Feb 3, 2025

HIP: add doc on small default launch bounds #11619

HIP: add doc on small default launch bounds #11619

Conversation

fxzjshm commented Feb 3, 2025

IMbackK commented Feb 3, 2025

fxzjshm commented Feb 3, 2025

IMbackK commented Feb 3, 2025 • edited Loading

fxzjshm commented Feb 3, 2025

IMbackK commented Feb 3, 2025 •

edited

Loading