-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: fix partial offloading for ne0 % 256 != 0 #8572
CUDA: fix partial offloading for ne0 % 256 != 0 #8572
Conversation
I am not sure that this is completely correct. The padding is already cleared in
|
eb76519
to
fcaa220
Compare
Thanks for the pointers. I wasn't sure about the correct way to edit |
fcaa220
to
991696e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml_backend_cuda_buffer_init_tensor
could also be updated to skip the cudaMemset
if the tensor is on a compute buffer, it might save a bit of time.
991696e
to
992d7c4
Compare
I see that you removed the |
992d7c4
to
8784fcd
Compare
Thanks for all the help. |
@@ -1485,6 +1485,13 @@ static void ggml_cuda_op_mul_mat( | |||
dev[id].src0_dd = dev[id].src0_dd_alloc.alloc(ctx.pool(id), ggml_nbytes(src0)); | |||
} | |||
|
|||
// If src0 is on a temporary compute buffers (partial offloading) there may be some padding that needs to be cleared: | |||
if (ne00 % MATRIX_ROW_PADDING != 0 && ggml_backend_buffer_get_usage(src0->buffer) == GGML_BACKEND_BUFFER_USAGE_COMPUTE && src0->view_src == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got illegal argument with float32, i think ggml_is_quantized(src0->type)
should be added
Fixes #8558 .
The MMVQ and MMQ kernels handle out-of-bounds accesses for
ne00
/ne10
by padding the last row ofsrc0
and every column ofsrc1
to a multiple ofMATRIX_ROW_PADDING
. The padding is memset to 0 so that it has no influence on the result. Crucially both paddings have to be memset to 0 because if one of the values were to randomly encode NaN or infinity then the result of a floating point multiplication with 0 is NaN.MMQ currently works on 256
ne00
/ne10
values per iteration. GLM 4 has tensors withne00 % 256 == 128
. So for this model the content of the padding is relevant. On master, when using partial offloading the padding is not being zeroed when the data is copied from RAM to VRAM. Therefore, depending on the input parameters, the result can become NaN. This PR adds the missing calls.There is also another, different issue with
--split-mode row
but so far I have not been able to track it down.