Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

Merged
merged 2 commits into from
Nov 13, 2023

Conversation

KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 commented Nov 11, 2023

See #4038 - Persimmon uses ReLU and SQR but those CUDA ops didn't exist. Looks like they are in Metal already just as a note.

This pull adds those ops. This still isn't enough for full offloading. You can offload n_layers + 1. So for the 8B model with 36 layers, -ngl 37 works but -ngl 38 does not.

edit: The next op it fails on seems to be CPY. Is the solution just do add that as well? Actually seems like CPY already exists so the problem must be something else like maybe it's not combination of tensor types that can be copied.

#3  0x00005555556bf394 in ggml_cuda_cpy (src0=0x7ffada8a1100, src1=0x7ffada8a1280, dst=0x0) at ggml-cuda.cu:7576
7576        GGML_ASSERT(src1->backend == GGML_BACKEND_GPU);
(gdb) p *src0
$1 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_GPU, buffer = 0x555567196b40, n_dims = 4, ne = {64, 64, 2, 3}, nb = {4, 768, 49152, 256}, op = GGML_OP_PERMUTE, op_params = {0, 3, 
    1, 2, 0 <repeats 12 times>}, is_param = false, grad = 0x0, src = {0x7ffada8a0f80, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 1, perf_cycles = 0, perf_time_us = 0, 
  view_src = 0x7ffada8a0e00, view_offs = 0, data = 0x7ffab9610160, name = "tmpqkv-0 (permuted)", '\000' <repeats 44 times>, extra = 0x55556b1e3c40, 
  padding = '\000' <repeats 11 times>}
(gdb) p *src1
$2 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_CPU, buffer = 0x555567196b40, n_dims = 4, ne = {64, 64, 2, 3}, nb = {4, 256, 16384, 32768}, op = GGML_OP_CONT, op_params = {
    0 <repeats 16 times>}, is_param = false, grad = 0x0, src = {0x7ffada8a1100, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, 
  view_offs = 0, data = 0x7ffab9628160, name = "tmpqkv-0\000(permuted) (cont)", '\000' <repeats 37 times>, extra = 0x0, padding = '\000' <repeats 11 times>}

One of the operands is on CPU. I don't know how to fix that though. edit: Well, the next issue after that is CPY only supports up to 3 dimensions but those tensors are 4D.

@KerfuffleV2 KerfuffleV2 added bug Something isn't working model Model specific Nvidia GPU Issues specific to Nvidia GPUs labels Nov 11, 2023
Comment on lines +436 to +437
#define CUDA_RELU_BLOCK_SIZE 256
#define CUDA_SQR_BLOCK_SIZE 256
Copy link
Collaborator Author

@KerfuffleV2 KerfuffleV2 Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what the optimal block sizes are. I just copied from SILU.

Copy link
Contributor

@SleepyYui SleepyYui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Block sizes should not matter that much, a thread on the Nvidia forums (a decade ago) suggests 128-256.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Persimmon graph seems to be doing some overly-complicated stuff. I haven't looked deeply in the logic, but we should simplify it. If necessary, the convert script can output data in a more convenient way if that can help to reduce the number of and types of ops currently used in the attention

@KerfuffleV2
Copy link
Collaborator Author

Thanks for the response.

The Persimmon graph seems to be doing some overly-complicated stuff.

Unfortunately I don't really know anything about the model, I just downloaded it based on an issue about the CUDA offloading and was able to find the issue.

If you have the time to answer, is there any way we can limit -ngl to just repeating layers + 1 just for Persimmon? I tried to look if there was a simple way to do that. It would be nice if people could do -ngl 100 or whatever and it wouldn't crash.

@ggerganov
Copy link
Owner

No, it's better to let it crash. Otherwise we will forget about this problem and won't fix it.
We can print a warning that references and issue/comment about this

@KerfuffleV2
Copy link
Collaborator Author

I added an #ifdef to the loader for the Persimmon case:

llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: CUDA backend missing Persimmon CUDA ops, can offload at most 37 layers. See: https://github.com/ggerganov/llama.cpp/issues/4038
error loading model: Persimmon CUDA offload failed
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/blah/adept-persimmon-8b-base-Q4_K_M.gguf'
main: error: unable to load model

Also checked CLBlast, doesn't seem to work with anything more than the repeating layers. I just get garbage output. I was going to add a message for that as well, but I don't know if it's an error specific to my system or whatever or the underlying cause.

@KerfuffleV2 KerfuffleV2 merged commit bb50a79 into ggerganov:master Nov 13, 2023
32 checks passed
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
…erganov#4041)

* Add ReLU and SQR CUDA ops to fix Persimmon offloading

* Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working model Model specific Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants