-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP: add doc on small default launch bounds #11619
Conversation
Related: ggml-org#10610 Signed-off-by: fxzjshm <fxzjshm@163.com>
No, I dont get why this would happen, how did the launch bounds end up 256? the max/default value on for amd targets is 1024. The only way you would end up in this situation is if you somehow set --gpu-max-threads-per-block=256 right before setting it back to 1024 |
DCU SDK is not from AMD so default behavior is different. I've checked with this simple vector add: vecadd.cpp
#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define page_size 1024
#define threads_per_block 1024
void hipCheck(hipError_t err, const char *file, int line) {
if (err != hipSuccess) {
printf("%s:%d: ", file, line);
printf("HIP error: %s\n", hipGetErrorString(err));
exit(EXIT_FAILURE);
}
}
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
int main() {
hipError_t err;
hipDeviceProp_t devProp;
hipGetDeviceProperties(&devProp, 0);
printf(" agent prop name %s\n", devProp.name);
float *A = (float *)malloc(page_size * sizeof(float));
float *B = (float *)malloc(page_size * sizeof(float));
float *C = (float *)malloc(page_size * sizeof(float));
for (int i = 0; i < page_size; i++) {
A[i] = 1;
B[i] = 1;
C[i] = 0;
}
float *a_d, *b_d, *c_d;
err = hipMalloc((void **)&a_d, page_size * sizeof(float));
hipCheck(err, __FILE__, __LINE__);
err = hipMalloc((void **)&b_d, page_size * sizeof(float));
hipCheck(err, __FILE__, __LINE__);
err = hipMalloc((void **)&c_d, page_size * sizeof(float));
hipCheck(err, __FILE__, __LINE__);
err = hipMemcpy(a_d, A, page_size * sizeof(float), hipMemcpyHostToDevice);
hipCheck(err, __FILE__, __LINE__);
err = hipMemcpy(b_d, B, page_size * sizeof(float), hipMemcpyHostToDevice);
hipCheck(err, __FILE__, __LINE__);
err = hipMemcpy(c_d, C, page_size * sizeof(float), hipMemcpyHostToDevice);
hipCheck(err, __FILE__, __LINE__);
int blocks_per_grid = page_size / threads_per_block;
vectorAdd<<<blocks_per_grid, threads_per_block>>>(a_d, b_d, c_d, page_size);
hipDeviceSynchronize();
err = hipGetLastError();
hipCheck(err, __FILE__, __LINE__);
err = hipMemcpy(C, c_d, page_size * sizeof(float), hipMemcpyDeviceToHost);
hipCheck(err, __FILE__, __LINE__);
for (int i = 0; i < page_size; i++) {
if (C[i] != 2.0) {
printf("Error: C[%d] = %f instead of 2.0\n", i, C[i]);
return EXIT_FAILURE;
}
}
printf("correct\n");
return EXIT_SUCCESS;
free(A);
free(B);
free(C);
hipFree(a_d);
hipFree(b_d);
hipFree(c_d);
} Compile with no other flags: hipcc vecadd.cpp -o vecadd On this gfx906 variant:
On gfx1100:
If compile with explicitly setting max thread to 1024: hipcc vecadd.cpp -o vecadd --gpu-max-threads-per-block=1024 then
|
So i gues this is about the Haiguang DCU Z100L, which is just a rebranded mi50 to be shipped in servers with the Haiguang built licensed zen1 cpus. So in the very distant past the default launch bounds in llvm where 256. It would appear like your "DCU SDK" whatever that is contains a very very old version of llvm. I would suggest you stop hurting yourself and just use upstream llvm. Beyond that we could set gpu-max-threads-per-block=1024 in our cmake files for hip, as this wont have any effect on upstream llvm, but would fix your case. Generally i dont really feal like catering to out of date proprietary forks of llvm. |
|
Some old compilers still use 256. Explicitly set it to 1024 to get correct result from ops like ARGMAX and GROUP_NORM. Related: ggml-org#10610, ggml-org#11619 Signed-off-by: fxzjshm <fxzjshm@163.com>
Some old compilers still use 256. Explicitly set it to 1024 to get correct result from ops like ARGMAX and GROUP_NORM. Related: ggml-org#10610, ggml-org#11619 Signed-off-by: fxzjshm <fxzjshm@163.com>
Related: #10610
See #10610 (comment)