-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: backwards pass for misc. ops, add tests #11257
CUDA: backwards pass for misc. ops, add tests #11257
Conversation
ggml/src/ggml-cuda/unary.cu
Outdated
@@ -20,7 +20,7 @@ static __global__ void step_f32(const float * x, float * dst, const int k) { | |||
dst[i] = x[i] > 0.0f; | |||
} | |||
|
|||
static __global__ void gelu_f32(const float * x, float * dst, const int k) { | |||
static __global__ void gelu_f32(const float * __restrict__ x, float * __restrict__ dst, const int k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that restrict is correct in these functions, since they can run in-place, which would make both pointers point to the same address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml ops can only be safely run in-place if both the source and destination pointers have the same memory layout. Otherwise you would run into issues with race conditions. The use of restrict allows the compiler to reorder memory accesses during optimization under the assumption that writes to one pointer do not affect reads from another pointer. But the compiler can never optimize the code to write the result before it reads the inputs. So the use of restrict is safe with in-place ops.
My opinion is that the use of restrict should be the default. I don't feel strongly about this though and would also be fine with making the use of restrict the exception for cases where it can be demonstrated to be safe and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that restrict is not likely to cause issues here, since as you say, there is nothing the compiler can do to optimize this. That still doesn't not make this correct, only accidentally correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree but I removed the use of restrict on all ops that can be used in-place, either explicitly through the ggml.h
API or implicitly through ggml_op_can_inplace
. On a somewhat related note, are there plans to eventually remove the _inplace
functions in ggml.h
and to have the logic regarding in-place operations be handled completely automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ggml-alloc, in-place is used to save memory, but in-place operations can also be useful to modify a static tensor. For example, in llama.cpp an _inplace
ROPE is used to perform the contexts shifts. Without _inplace
, this would require a normal ROPE followed by a CPY. In principle it would be possible to optimize away unnecessary CPY operations by making the previous operation in-place automatically, but that's not implemented, and probably would not be worth the complexity. So at the moment _inplace
operations still have an use.
float max_bias = 0.0f; | ||
|
||
memcpy(&max_bias, (const float *) op->op_params + 1, sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we should make the op params get/set more strongly typed (e.g. via get/set functions for each one). Just an idea, no action required for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct way to do this would be to define per-op structs for the op params.
const int nbytes_shared = ne00*sizeof(float); | ||
|
||
const int id = ggml_cuda_get_device(); | ||
const int smpbo = ggml_cuda_info().devices[id].smpbo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int nbytes_shared = ne00*sizeof(float); | |
const int id = ggml_cuda_get_device(); | |
const int smpbo = ggml_cuda_info().devices[id].smpbo; | |
const size_t nbytes_shared = ne00*sizeof(float); | |
const int id = ggml_cuda_get_device(); | |
const size_t smpbo = ggml_cuda_info().devices[id].smpbo; |
const int nbytes_shared = ne00*sizeof(float); | ||
|
||
const int id = ggml_cuda_get_device(); | ||
const int smpbo = ggml_cuda_info().devices[id].smpbo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int nbytes_shared = ne00*sizeof(float); | |
const int id = ggml_cuda_get_device(); | |
const int smpbo = ggml_cuda_info().devices[id].smpbo; | |
const size_t nbytes_shared = ne00*sizeof(float); | |
const int id = ggml_cuda_get_device(); | |
const size_t smpbo = ggml_cuda_info().devices[id].smpbo; |
ggml/src/ggml-cuda/norm.cu
Outdated
grad += row*ncols; | ||
xf += row*ncols; | ||
dst += row*ncols; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there risk of integer overflow here? Maybe to be safe:
grad += row*ncols; | |
xf += row*ncols; | |
dst += row*ncols; | |
grad += int64_t(row)*ncols; | |
xf += int64_t(row)*ncols; | |
dst += int64_t(row)*ncols; |
Same comment in a few other places where we multiply row*ncols
.
List of changes:
SOFT_MAX
backward pass to be equivalent to the forward pass. Support forscale
is implemented, support formax_bias
is not implemented. There is no need formask
since it does not affect the gradients of the forward pass input beyond its effects on the forward pass output. Added CUDA support for the backward pass.CROSS_ENTROPY_LOSS
not working correctly for large vocabulary sizes.OUT_PROD
to support the backward pass for GQA.GET_ROWS
backward pass.SILU
backward pass.RMS_NORM
backward pass.test-backend-ops
to assert the consistency of the CPU and CUDA implementations.