CUDA: backwards pass for misc. ops, add tests #11257

JohannesGaessler · 2025-01-15T20:01:54Z

List of changes:

Extended the interface of the SOFT_MAX backward pass to be equivalent to the forward pass. Support for scale is implemented, support for max_bias is not implemented. There is no need for mask since it does not affect the gradients of the forward pass input beyond its effects on the forward pass output. Added CUDA support for the backward pass.
Fixed the CUDA implementation of CROSS_ENTROPY_LOSS not working correctly for large vocabulary sizes.
Implemented broadcasting of dimensions 2 and 3 for OUT_PROD to support the backward pass for GQA.
Added CUDA support for the GET_ROWS backward pass.
Added CUDA support for the SILU backward pass.
Added CUDA support for the RMS_NORM backward pass.
Added tests to test-backend-ops to assert the consistency of the CPU and CUDA implementations.
Refactored the dedicated functions for backward pass construction to always have the gradients as the first argument.

slaren · 2025-01-16T00:45:33Z

ggml/src/ggml-cuda/unary.cu

@@ -20,7 +20,7 @@ static __global__ void step_f32(const float * x, float * dst, const int k) {
    dst[i] = x[i] > 0.0f;
 }

-static __global__ void gelu_f32(const float * x, float * dst, const int k) {
+static __global__ void gelu_f32(const float * __restrict__ x, float * __restrict__ dst, const int k) {


I don't think that restrict is correct in these functions, since they can run in-place, which would make both pointers point to the same address.

ggml ops can only be safely run in-place if both the source and destination pointers have the same memory layout. Otherwise you would run into issues with race conditions. The use of restrict allows the compiler to reorder memory accesses during optimization under the assumption that writes to one pointer do not affect reads from another pointer. But the compiler can never optimize the code to write the result before it reads the inputs. So the use of restrict is safe with in-place ops.

My opinion is that the use of restrict should be the default. I don't feel strongly about this though and would also be fine with making the use of restrict the exception for cases where it can be demonstrated to be safe and useful.

I agree that restrict is not likely to cause issues here, since as you say, there is nothing the compiler can do to optimize this. That still doesn't not make this correct, only accidentally correct.

I disagree but I removed the use of restrict on all ops that can be used in-place, either explicitly through the ggml.h API or implicitly through ggml_op_can_inplace. On a somewhat related note, are there plans to eventually remove the _inplace functions in ggml.h and to have the logic regarding in-place operations be handled completely automatically?

In ggml-alloc, in-place is used to save memory, but in-place operations can also be useful to modify a static tensor. For example, in llama.cpp an _inplace ROPE is used to perform the contexts shifts. Without _inplace, this would require a normal ROPE followed by a CPY. In principle it would be possible to optimize away unnecessary CPY operations by making the previous operation in-place automatically, but that's not implemented, and probably would not be worth the complexity. So at the moment _inplace operations still have an use.

ggerganov · 2025-01-16T08:45:37Z

ggml/src/ggml-cpu/ggml-cpu.cpp

+            float max_bias = 0.0f;
+
+            memcpy(&max_bias, (const float *) op->op_params + 1, sizeof(float));


At some point we should make the op params get/set more strongly typed (e.g. via get/set functions for each one). Just an idea, no action required for this PR.

I think the correct way to do this would be to define per-op structs for the op params.

ggerganov · 2025-01-16T13:47:05Z

ggml/src/ggml-cuda/cross-entropy-loss.cu

+    const int nbytes_shared = ne00*sizeof(float);
+
+    const int id    = ggml_cuda_get_device();
+    const int smpbo = ggml_cuda_info().devices[id].smpbo;


Suggested change

const int nbytes_shared = ne00*sizeof(float);

const int id = ggml_cuda_get_device();

const int smpbo = ggml_cuda_info().devices[id].smpbo;

const size_t nbytes_shared = ne00*sizeof(float);

const int id = ggml_cuda_get_device();

const size_t smpbo = ggml_cuda_info().devices[id].smpbo;

ggerganov · 2025-01-16T13:47:21Z

ggml/src/ggml-cuda/cross-entropy-loss.cu

+    const int nbytes_shared = ne00*sizeof(float);
+
+    const int id    = ggml_cuda_get_device();
+    const int smpbo = ggml_cuda_info().devices[id].smpbo;


Suggested change

const int nbytes_shared = ne00*sizeof(float);

const int id = ggml_cuda_get_device();

const int smpbo = ggml_cuda_info().devices[id].smpbo;

const size_t nbytes_shared = ne00*sizeof(float);

const int id = ggml_cuda_get_device();

const size_t smpbo = ggml_cuda_info().devices[id].smpbo;

ggerganov · 2025-01-16T13:50:52Z

ggml/src/ggml-cuda/norm.cu

+    grad += row*ncols;
+    xf   += row*ncols;
+    dst  += row*ncols;


Is there risk of integer overflow here? Maybe to be safe:

Suggested change

grad += row*ncols;

xf += row*ncols;

dst += row*ncols;

grad += int64_t(row)*ncols;

xf += int64_t(row)*ncols;

dst += int64_t(row)*ncols;

Same comment in a few other places where we multiply row*ncols.

CUDA: backwards pass for misc. ops, add tests

0334f6b

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 15, 2025

fix sanitizer compile

deab327

slaren reviewed Jan 16, 2025

View reviewed changes

ggerganov reviewed Jan 16, 2025

View reviewed changes

remove restrict from pointers

1120d94

slaren approved these changes Jan 16, 2025

View reviewed changes

ggerganov approved these changes Jan 16, 2025

View reviewed changes

fix pot. int overflows

066f6cf

JohannesGaessler merged commit 9c8dcef into ggerganov:master Jan 16, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: backwards pass for misc. ops, add tests #11257

CUDA: backwards pass for misc. ops, add tests #11257

JohannesGaessler commented Jan 15, 2025

slaren Jan 16, 2025 •

edited

Loading

JohannesGaessler Jan 16, 2025

slaren Jan 16, 2025 •

edited

Loading

JohannesGaessler Jan 16, 2025

slaren Jan 16, 2025 •

edited

Loading

ggerganov Jan 16, 2025

JohannesGaessler Jan 16, 2025

ggerganov Jan 16, 2025

ggerganov Jan 16, 2025

ggerganov Jan 16, 2025

		float max_bias = 0.0f;

		memcpy(&max_bias, (const float *) op->op_params + 1, sizeof(float));

CUDA: backwards pass for misc. ops, add tests #11257

CUDA: backwards pass for misc. ops, add tests #11257

Conversation

JohannesGaessler commented Jan 15, 2025

slaren Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler Jan 16, 2025

Choose a reason for hiding this comment

slaren Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler Jan 16, 2025

Choose a reason for hiding this comment

slaren Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

ggerganov Jan 16, 2025

Choose a reason for hiding this comment

JohannesGaessler Jan 16, 2025

Choose a reason for hiding this comment

ggerganov Jan 16, 2025

Choose a reason for hiding this comment

ggerganov Jan 16, 2025

Choose a reason for hiding this comment

ggerganov Jan 16, 2025

Choose a reason for hiding this comment

slaren Jan 16, 2025 •

edited

Loading

slaren Jan 16, 2025 •

edited

Loading

slaren Jan 16, 2025 •

edited

Loading