Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: backwards pass for misc. ops, add tests #11257

Merged
merged 4 commits into from
Jan 16, 2025

Conversation

JohannesGaessler
Copy link
Collaborator

List of changes:

  • Extended the interface of the SOFT_MAX backward pass to be equivalent to the forward pass. Support for scale is implemented, support for max_bias is not implemented. There is no need for mask since it does not affect the gradients of the forward pass input beyond its effects on the forward pass output. Added CUDA support for the backward pass.
  • Fixed the CUDA implementation of CROSS_ENTROPY_LOSS not working correctly for large vocabulary sizes.
  • Implemented broadcasting of dimensions 2 and 3 for OUT_PROD to support the backward pass for GQA.
  • Added CUDA support for the GET_ROWS backward pass.
  • Added CUDA support for the SILU backward pass.
  • Added CUDA support for the RMS_NORM backward pass.
  • Added tests to test-backend-ops to assert the consistency of the CPU and CUDA implementations.
  • Refactored the dedicated functions for backward pass construction to always have the gradients as the first argument.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 15, 2025
@@ -20,7 +20,7 @@ static __global__ void step_f32(const float * x, float * dst, const int k) {
dst[i] = x[i] > 0.0f;
}

static __global__ void gelu_f32(const float * x, float * dst, const int k) {
static __global__ void gelu_f32(const float * __restrict__ x, float * __restrict__ dst, const int k) {
Copy link
Collaborator

@slaren slaren Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that restrict is correct in these functions, since they can run in-place, which would make both pointers point to the same address.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml ops can only be safely run in-place if both the source and destination pointers have the same memory layout. Otherwise you would run into issues with race conditions. The use of restrict allows the compiler to reorder memory accesses during optimization under the assumption that writes to one pointer do not affect reads from another pointer. But the compiler can never optimize the code to write the result before it reads the inputs. So the use of restrict is safe with in-place ops.

My opinion is that the use of restrict should be the default. I don't feel strongly about this though and would also be fine with making the use of restrict the exception for cases where it can be demonstrated to be safe and useful.

Copy link
Collaborator

@slaren slaren Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that restrict is not likely to cause issues here, since as you say, there is nothing the compiler can do to optimize this. That still doesn't not make this correct, only accidentally correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree but I removed the use of restrict on all ops that can be used in-place, either explicitly through the ggml.h API or implicitly through ggml_op_can_inplace. On a somewhat related note, are there plans to eventually remove the _inplace functions in ggml.h and to have the logic regarding in-place operations be handled completely automatically?

Copy link
Collaborator

@slaren slaren Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ggml-alloc, in-place is used to save memory, but in-place operations can also be useful to modify a static tensor. For example, in llama.cpp an _inplace ROPE is used to perform the contexts shifts. Without _inplace, this would require a normal ROPE followed by a CPY. In principle it would be possible to optimize away unnecessary CPY operations by making the previous operation in-place automatically, but that's not implemented, and probably would not be worth the complexity. So at the moment _inplace operations still have an use.

Comment on lines +410 to +412
float max_bias = 0.0f;

memcpy(&max_bias, (const float *) op->op_params + 1, sizeof(float));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we should make the op params get/set more strongly typed (e.g. via get/set functions for each one). Just an idea, no action required for this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way to do this would be to define per-op structs for the op params.

Comment on lines 118 to 121
const int nbytes_shared = ne00*sizeof(float);

const int id = ggml_cuda_get_device();
const int smpbo = ggml_cuda_info().devices[id].smpbo;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int nbytes_shared = ne00*sizeof(float);
const int id = ggml_cuda_get_device();
const int smpbo = ggml_cuda_info().devices[id].smpbo;
const size_t nbytes_shared = ne00*sizeof(float);
const int id = ggml_cuda_get_device();
const size_t smpbo = ggml_cuda_info().devices[id].smpbo;

Comment on lines 172 to 175
const int nbytes_shared = ne00*sizeof(float);

const int id = ggml_cuda_get_device();
const int smpbo = ggml_cuda_info().devices[id].smpbo;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int nbytes_shared = ne00*sizeof(float);
const int id = ggml_cuda_get_device();
const int smpbo = ggml_cuda_info().devices[id].smpbo;
const size_t nbytes_shared = ne00*sizeof(float);
const int id = ggml_cuda_get_device();
const size_t smpbo = ggml_cuda_info().devices[id].smpbo;

Comment on lines 143 to 145
grad += row*ncols;
xf += row*ncols;
dst += row*ncols;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there risk of integer overflow here? Maybe to be safe:

Suggested change
grad += row*ncols;
xf += row*ncols;
dst += row*ncols;
grad += int64_t(row)*ncols;
xf += int64_t(row)*ncols;
dst += int64_t(row)*ncols;

Same comment in a few other places where we multiply row*ncols.

@JohannesGaessler JohannesGaessler merged commit 9c8dcef into ggerganov:master Jan 16, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants