-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip computation of much of last layer & unused logits during prompt eval / large N #2700
Conversation
This doesn't seem compatible with #1087 (the ROCM port). I just get garbage output trying to evaluate a model. Doesn't seem to affect the perplexity tool (as expected). I tested with CPU only and CLBlast - seems to work fine with those. (Only tested evaluation.) |
Is there something specific with the ROCM stuff that just makes it impossible to use this approach? I was just mentioning it, wasn't sure if the issue was with this pull or the ROCM one or whether there were changes that could fix it. |
@KerfuffleV2 Thanks a lot for testing it out! I've now disabled this when LLAMA_USE_HIPBLAS is defined, but after glancing through that PR I don't see any reason why it wouldn't work, although I have very low understanding of anything CUDA- or ROCm-related. @ggerganov Do you have existing tricks to compare intermediate outputs between backends? (some way to force synchronous evaluation of each operation maybe, e.g. w/ its own 1-node graph & compute call; maybe appending each tensor to a binary file then reading it / comparing it on the fly?). That would allow spotting at which exact op garbage starts appearing (could potentially reveal an underlying bug in the ROCm code if it's the only backend causing issues). If there's interest in shipping this PR (which is arguably borderline useful given low single-digit speedup) I'll try and rent a cloud ROCm instance to debug 🤗 |
@SlyEcho Any idea of what could be going on here? #2700 (comment) |
Should not be related to AMD, will have to test CUDA as well. I suspect something about the shape of the tensors is not suited to our optimized kernels. |
…r batch positions after writing their kv to cache)
I tried the collab with some -ngl values smaller than 33 and they produce garbage for output. |
Yes, I wanted to check what the problem was on ROCm, but it's the same if fully offloading it work but if offloading a few layers it doesn't. |
@Engininja2 @ardfork Thanks, great to be able to repro the issue! @KerfuffleV2 @SlyEcho This PR's definitely broken on CUDA independently of the ROCm changes when not all layers are offloaded to the GPU. I don't understand the offload_func_* logic yet, reckon I'll need to update it. |
I suspect it's got something to do with copying memory between device and host. |
cmake -DLLAMA_SKIP_UNUSED_LOGITS=OFF ... LLAMA_NO_SKIP_UNUSED_LOGITS=1 make ...
Pushed a naive tweak that seems to fix the output for all ngl values... except Also @SlyEcho I've moved the define to the makefile / CMakeLists.txt as you suggested. |
llama.cpp
Outdated
if (il == n_layer - 1 && !lctx.logits_all) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not follow the coding guidelines in the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
llama.cpp
Outdated
offload_func_t offload_func_nr = llama_nop; // nr = non-repeating | ||
offload_func_t offload_func_kq = llama_nop; | ||
offload_func_t offload_func_v = llama_nop; | ||
offload_func_t offload_func_skip = llama_nop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you defining offload_func_skip
? If I understand your code correctly it works by discarding the unneeded parts of tensors in the last layer. I don't understand why this would affect the offloading logic, i.e. whether data should be stored in RAM or VRAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping offload_func_skip
being distinct would help things, but there seems to be some entangling anyway.
Not sure I understand the interplay between ggml_cuda_compute_forward
and the CPU fallback route, but I've noticed the former skips GGML_OP_VIEW if its input isn't on the GPU, so I've now tried to make sure the offload of cur
and inpSA
's subviews matches whatever backend they currently were using the following defensive code, and... it seems to fix the issue:
auto cur_on_gpu = cur->backend == GGML_BACKEND_GPU;
cur = ggml_view_2d(ctx0, cur, n_embd, 1, cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
if (cur_on_gpu) ggml_cuda_assign_buffers_no_alloc(cur);
auto inpSA_on_gpu = inpSA->backend == GGML_BACKEND_GPU;
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
if (inpSA_on_gpu) ggml_cuda_assign_buffers_no_alloc(inpSA);
The issue being when -ngl 1
, i_gpu_start = n_layer - 1
so the last layer is the only one that offloads most of its tensors (as per the if (il >= i_gpu_start) { offload_func = ggml_cuda_assign_buffers_no_alloc; }
block).
Another simpler fix (pushed on this PR) that also seems to work is to offload the view inputs (but that may be offloading more than intended?):
offload_func(cur);
cur = ggml_view_2d(ctx0, cur, n_embd, 1, cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
offload_func(cur);
offload_func(inpSA);
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
offload_func(inpSA);
Will try to find nicer ways to fix this (wondering if the CPU view fallback in a mostly GPU context is behaving as expected?), suggestions welcome 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route
The decision for whether or not CUDA should be used is based on data location. If the data of any of the tensors is in VRAM then all of the calculations are done on the GPU and the data is copied to/from VRAM as necessary based on backend. My preferred design would have been to add a state to the ggml context that sets the backend for all newly created tensors. But unfortunately this way vetoed by Georgi who wanted to avoid GPU offloading logic in ggml. So the current design is kind of cancerous where you have to manually offload each tensor. For tensors that don't actually do any computations this is very awkward because there is no step where the data would be carried over if necessary. Maybe I should add an explicit check for those tensors to ensure that the backends are consistent.
I pushed a fix that should have the correct offloading logic here. |
If it works all the time everywhere I don't think there needs to be a compile option. But if it stays it should be enabled by default for most users. |
@JohannesGaessler thanks so much for looking into it! I still seem to be getting wrong output with that change though (still only broken for Edit: found the issue, see review thread |
Good catch with -ngl 1, it seems that only that exact value was broken. I pushed more changes to my own repository. The changes are an extra check to the CUDA code to let you safely offload the same tensor multiple times (which I think currently can happen) and a removal of the first call to |
@JohannesGaessler thanks a lot, I've merged your changes! (sorry for the delay, faced some mysterious segfaults on Colab, seems to have been fixed by merging master again 😅) |
Makefile
Outdated
@@ -441,6 +441,15 @@ k_quants.o: k_quants.c k_quants.h | |||
$(CC) $(CFLAGS) -c $< -o $@ | |||
endif # LLAMA_NO_K_QUANTS | |||
|
|||
ifndef LLAMA_NO_SKIP_UNUSED_LOGITS | |||
CFLAGS += -DLLAMA_SKIP_UNUSED_LOGITS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't put indented assignments below a build rule like this. The ifdef doesn't make a difference syntactically; this line is treated as part of the k-quants build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks, I'd overlooked 4dcd47d (and hadn't merged it well), fixed.
Updated the description w/ CPU speedup figures for TinyLlamas models (15-25% faster) |
(Bit-rotten since multimodel support &) Superseded by #6122 🎉
TL;DR: the speedup is modest but consistent on CPU (~4% for 7B model, 15-25% for TinyLlamas models on M2 Mac) for a rather simple change
During prompt eval (w/ N > 1), a lot of what is done at the last layer is wasted unless we want all the logits out: it's fine to focus on the last token once that layer finished writing to the kv-cache.
Concretely, this means we can drop all but the last column of
cur
andinpSA
after writing to the KV cache at the last layer, unlesslogits_all
is set (I've movedtmpq
&Qcur
to right after that narrowing reshape).Some notes on performance:
To test this branch: