Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: fix partial offloading for ne0 % 256 != 0 #8572

Merged
merged 1 commit into from
Jul 18, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #8558 .

The MMVQ and MMQ kernels handle out-of-bounds accesses for ne00/ne10 by padding the last row of src0 and every column of src1 to a multiple of MATRIX_ROW_PADDING. The padding is memset to 0 so that it has no influence on the result. Crucially both paddings have to be memset to 0 because if one of the values were to randomly encode NaN or infinity then the result of a floating point multiplication with 0 is NaN.

MMQ currently works on 256 ne00/ne10 values per iteration. GLM 4 has tensors with ne00 % 256 == 128. So for this model the content of the padding is relevant. On master, when using partial offloading the padding is not being zeroed when the data is copied from RAM to VRAM. Therefore, depending on the input parameters, the result can become NaN. This PR adds the missing calls.

There is also another, different issue with --split-mode row but so far I have not been able to track it down.

@slaren
Copy link
Collaborator

slaren commented Jul 18, 2024

I am not sure that this is completely correct. The padding is already cleared in init_tensor. For a tensor allocated in the compute buffer, this should happen before the graph computation, but the problem is that the padding may be overwritten by the previous operations before the tensor is actually used. Clearing the padding in the set_tensor function will not work on every case, because the tensor may have been initialized in a different way (eg. from a ggml_cpy operation). I think this would work more reliably:

  • Add GGML_BACKEND_BUFFER_USAGE_COMPUTE to ggml_backend_buffer_usage
  • Call ggml_backend_buffer_set_usage to this type in ggml-alloc so that compute buffers can be reliably identified
  • Before computing the mul_mat operation in the CUDA backend, check if src0->buffer->usage == GGML_BACKEND_BUFFER_USAGE_COMPUTE, and if so, clear the padding there.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 18, 2024
@JohannesGaessler
Copy link
Collaborator Author

Thanks for the pointers. I wasn't sure about the correct way to edit ggml-alloc.c, I'm setting the property immediately after allocation. Also I am zeroing the memory in ggml_cuda_op_mul_mat instead because for ne2*ne3 > 1 only the last ne0 x ne1 matrix should be padded.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_backend_cuda_buffer_init_tensor could also be updated to skip the cudaMemset if the tensor is on a compute buffer, it might save a bit of time.

ggml/src/ggml-alloc.c Outdated Show resolved Hide resolved
ggml/src/ggml-cuda.cu Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Jul 18, 2024

ggml_backend_cuda_buffer_init_tensor could also be updated to skip the cudaMemset if the tensor is on a compute buffer, it might save a bit of time.

I see that you removed the cudaMemset entirely, but I think this still needs to be done for weights & KV, it can only be skipped if the tensor is on a compute buffer.

@JohannesGaessler JohannesGaessler merged commit a15ef8f into ggerganov:master Jul 18, 2024
53 checks passed
@JohannesGaessler
Copy link
Collaborator Author

Thanks for all the help.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
@@ -1485,6 +1485,13 @@ static void ggml_cuda_op_mul_mat(
dev[id].src0_dd = dev[id].src0_dd_alloc.alloc(ctx.pool(id), ggml_nbytes(src0));
}

// If src0 is on a temporary compute buffers (partial offloading) there may be some padding that needs to be cleared:
if (ne00 % MATRIX_ROW_PADDING != 0 && ggml_backend_buffer_get_usage(src0->buffer) == GGML_BACKEND_BUFFER_USAGE_COMPUTE && src0->view_src == nullptr) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got illegal argument with float32, i think ggml_is_quantized(src0->type) should be added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: GLM4 9b produces wrong results with partial offloading
3 participants