CUDA: fix partial offloading for ne0 % 256 != 0 #8572

JohannesGaessler · 2024-07-18T15:41:07Z

Fixes #8558 .

The MMVQ and MMQ kernels handle out-of-bounds accesses for ne00/ne10 by padding the last row of src0 and every column of src1 to a multiple of MATRIX_ROW_PADDING. The padding is memset to 0 so that it has no influence on the result. Crucially both paddings have to be memset to 0 because if one of the values were to randomly encode NaN or infinity then the result of a floating point multiplication with 0 is NaN.

MMQ currently works on 256 ne00/ne10 values per iteration. GLM 4 has tensors with ne00 % 256 == 128. So for this model the content of the padding is relevant. On master, when using partial offloading the padding is not being zeroed when the data is copied from RAM to VRAM. Therefore, depending on the input parameters, the result can become NaN. This PR adds the missing calls.

There is also another, different issue with --split-mode row but so far I have not been able to track it down.

slaren · 2024-07-18T16:02:08Z

I am not sure that this is completely correct. The padding is already cleared in init_tensor. For a tensor allocated in the compute buffer, this should happen before the graph computation, but the problem is that the padding may be overwritten by the previous operations before the tensor is actually used. Clearing the padding in the set_tensor function will not work on every case, because the tensor may have been initialized in a different way (eg. from a ggml_cpy operation). I think this would work more reliably:

Add GGML_BACKEND_BUFFER_USAGE_COMPUTE to ggml_backend_buffer_usage
Call ggml_backend_buffer_set_usage to this type in ggml-alloc so that compute buffers can be reliably identified
Before computing the mul_mat operation in the CUDA backend, check if src0->buffer->usage == GGML_BACKEND_BUFFER_USAGE_COMPUTE, and if so, clear the padding there.

JohannesGaessler · 2024-07-18T18:56:26Z

Thanks for the pointers. I wasn't sure about the correct way to edit ggml-alloc.c, I'm setting the property immediately after allocation. Also I am zeroing the memory in ggml_cuda_op_mul_mat instead because for ne2*ne3 > 1 only the last ne0 x ne1 matrix should be padded.

slaren

ggml_backend_cuda_buffer_init_tensor could also be updated to skip the cudaMemset if the tensor is on a compute buffer, it might save a bit of time.

ggml/src/ggml-alloc.c

ggml/src/ggml-cuda.cu

slaren · 2024-07-18T19:49:26Z

ggml_backend_cuda_buffer_init_tensor could also be updated to skip the cudaMemset if the tensor is on a compute buffer, it might save a bit of time.

I see that you removed the cudaMemset entirely, but I think this still needs to be done for weights & KV, it can only be skipped if the tensor is on a compute buffer.

JohannesGaessler · 2024-07-18T21:48:56Z

Thanks for all the help.

forworldm · 2024-08-02T19:08:24Z

ggml/src/ggml-cuda.cu

@@ -1485,6 +1485,13 @@ static void ggml_cuda_op_mul_mat(
            dev[id].src0_dd = dev[id].src0_dd_alloc.alloc(ctx.pool(id), ggml_nbytes(src0));
        }

+        // If src0 is on a temporary compute buffers (partial offloading) there may be some padding that needs to be cleared:
+        if (ne00 % MATRIX_ROW_PADDING != 0 && ggml_backend_buffer_get_usage(src0->buffer) == GGML_BACKEND_BUFFER_USAGE_COMPUTE && src0->view_src == nullptr) {


got illegal argument with float32, i think ggml_is_quantized(src0->type) should be added

JohannesGaessler force-pushed the cuda-glm4-fix branch from eb76519 to fcaa220 Compare July 18, 2024 18:52

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 18, 2024

JohannesGaessler force-pushed the cuda-glm4-fix branch from fcaa220 to 991696e Compare July 18, 2024 19:10

slaren reviewed Jul 18, 2024

View reviewed changes

ggml/src/ggml-alloc.c Outdated Show resolved Hide resolved

ggml/src/ggml-cuda.cu Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the cuda-glm4-fix branch from 991696e to 992d7c4 Compare July 18, 2024 19:37

CUDA: fix partial offloading for ne0 % 256 != 0

8784fcd

JohannesGaessler force-pushed the cuda-glm4-fix branch from 992d7c4 to 8784fcd Compare July 18, 2024 20:16

slaren approved these changes Jul 18, 2024

View reviewed changes

JohannesGaessler merged commit a15ef8f into ggerganov:master Jul 18, 2024
53 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024

CUDA: fix partial offloading for ne0 % 256 != 0 (ggerganov#8572)

a2266e0

forworldm reviewed Aug 2, 2024

View reviewed changes

JohannesGaessler mentioned this pull request Aug 6, 2024

CUDA: fix padding logic for FP16/FP32 #8884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix partial offloading for ne0 % 256 != 0 #8572

CUDA: fix partial offloading for ne0 % 256 != 0 #8572

JohannesGaessler commented Jul 18, 2024

slaren commented Jul 18, 2024

JohannesGaessler commented Jul 18, 2024

slaren left a comment

slaren commented Jul 18, 2024

JohannesGaessler commented Jul 18, 2024

forworldm Aug 2, 2024

CUDA: fix partial offloading for ne0 % 256 != 0 #8572

CUDA: fix partial offloading for ne0 % 256 != 0 #8572

Conversation

JohannesGaessler commented Jul 18, 2024

slaren commented Jul 18, 2024

JohannesGaessler commented Jul 18, 2024

slaren left a comment

Choose a reason for hiding this comment

slaren commented Jul 18, 2024

JohannesGaessler commented Jul 18, 2024

forworldm Aug 2, 2024

Choose a reason for hiding this comment