Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP: add doc on small default launch bounds #11619

Closed
wants to merge 1 commit into from

Conversation

fxzjshm
Copy link
Contributor

@fxzjshm fxzjshm commented Feb 3, 2025

Related: #10610

See #10610 (comment)

Related: ggml-org#10610
Signed-off-by: fxzjshm <fxzjshm@163.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 3, 2025
@IMbackK
Copy link
Collaborator

IMbackK commented Feb 3, 2025

No, I dont get why this would happen, how did the launch bounds end up 256? the max/default value on for amd targets is 1024. The only way you would end up in this situation is if you somehow set --gpu-max-threads-per-block=256 right before setting it back to 1024

@fxzjshm
Copy link
Contributor Author

fxzjshm commented Feb 3, 2025

DCU SDK is not from AMD so default behavior is different.
Unfortunately little info is available due to policy of this vendor.

I've checked with this simple vector add:

vecadd.cpp

#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define page_size  1024
#define threads_per_block 1024


void hipCheck(hipError_t err, const char *file, int line) {
    if (err != hipSuccess) {
        printf("%s:%d: ", file, line);
        printf("HIP error: %s\n", hipGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    hipError_t err;

    hipDeviceProp_t devProp;
    hipGetDeviceProperties(&devProp, 0);
    printf(" agent prop name %s\n", devProp.name);

    float *A = (float *)malloc(page_size * sizeof(float));
    float *B = (float *)malloc(page_size * sizeof(float));
    float *C = (float *)malloc(page_size * sizeof(float));

    for (int i = 0; i < page_size; i++) {
        A[i] = 1;
        B[i] = 1;
        C[i] = 0;
    }

    float *a_d, *b_d, *c_d;
    err = hipMalloc((void **)&a_d, page_size * sizeof(float));
    hipCheck(err, __FILE__, __LINE__);
    err = hipMalloc((void **)&b_d, page_size * sizeof(float));
    hipCheck(err, __FILE__, __LINE__);
    err = hipMalloc((void **)&c_d, page_size * sizeof(float));
    hipCheck(err, __FILE__, __LINE__);

    err = hipMemcpy(a_d, A, page_size * sizeof(float), hipMemcpyHostToDevice);
    hipCheck(err, __FILE__, __LINE__);
    err = hipMemcpy(b_d, B, page_size * sizeof(float), hipMemcpyHostToDevice);
    hipCheck(err, __FILE__, __LINE__);
    err = hipMemcpy(c_d, C, page_size * sizeof(float), hipMemcpyHostToDevice);
    hipCheck(err, __FILE__, __LINE__);

    int blocks_per_grid = page_size / threads_per_block;
    vectorAdd<<<blocks_per_grid, threads_per_block>>>(a_d, b_d, c_d, page_size);
    hipDeviceSynchronize();

    err = hipGetLastError();
    hipCheck(err, __FILE__, __LINE__);

    err = hipMemcpy(C, c_d, page_size * sizeof(float), hipMemcpyDeviceToHost);
    hipCheck(err, __FILE__, __LINE__);

    for (int i = 0; i < page_size; i++) {
        if (C[i] != 2.0) {
            printf("Error: C[%d] = %f instead of 2.0\n", i, C[i]);
            return EXIT_FAILURE;
        }
    }

    printf("correct\n");
    return EXIT_SUCCESS;

    free(A);
    free(B);
    free(C);
    hipFree(a_d);
    hipFree(b_d);
    hipFree(c_d);
}

Compile with no other flags:

hipcc vecadd.cpp -o vecadd

On this gfx906 variant:

 agent prop name [REMOVED]
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _Z9vectorAddPfS_S_i please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! 
correct

On gfx1100:

 agent prop name Radeon RX 7900 XTX
correct

If compile with explicitly setting max thread to 1024:

hipcc vecadd.cpp -o vecadd --gpu-max-threads-per-block=1024

then

 agent prop name [REMOVED]
correct

@IMbackK
Copy link
Collaborator

IMbackK commented Feb 3, 2025

So i gues this is about the Haiguang DCU Z100L, which is just a rebranded mi50 to be shipped in servers with the Haiguang built licensed zen1 cpus.

So in the very distant past the default launch bounds in llvm where 256. It would appear like your "DCU SDK" whatever that is contains a very very old version of llvm. I would suggest you stop hurting yourself and just use upstream llvm.

Beyond that we could set gpu-max-threads-per-block=1024 in our cmake files for hip, as this wont have any effect on upstream llvm, but would fix your case.

Generally i dont really feal like catering to out of date proprietary forks of llvm.

@fxzjshm
Copy link
Contributor Author

fxzjshm commented Feb 3, 2025

  1. Unfortunately upstream LLVM doesn't support the product well, I've tried but no luck. I've asked them when to merge their work to upstream LLVM but they refused, just like many other companies I've got in touch with. Sad.
  2. This SDK is the most recent version I've found, and hipcc from it claims that it is Clang 15. Not sure why max thread is still 256.

fxzjshm added a commit to fxzjshm/llama.cpp that referenced this pull request Feb 3, 2025
Some old compilers still use 256. Explicitly set it to 1024 to get correct
result from ops like ARGMAX and GROUP_NORM.

Related: ggml-org#10610, ggml-org#11619
Signed-off-by: fxzjshm <fxzjshm@163.com>
@IMbackK IMbackK closed this Feb 3, 2025
fxzjshm added a commit to fxzjshm/llama.cpp that referenced this pull request Feb 4, 2025
Some old compilers still use 256. Explicitly set it to 1024 to get correct
result from ops like ARGMAX and GROUP_NORM.

Related: ggml-org#10610, ggml-org#11619
Signed-off-by: fxzjshm <fxzjshm@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants