Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_decode is significantly slower if n_tokens > 1 #4624

Closed
apoorvumang opened this issue Dec 24, 2023 · 7 comments
Closed

llama_decode is significantly slower if n_tokens > 1 #4624

apoorvumang opened this issue Dec 24, 2023 · 7 comments
Labels
bug-unconfirmed macos Issues specific to macOS performance Speed related topics stale

Comments

@apoorvumang
Copy link

Issue

It is expected that llama_decode should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) with mistral-7b-instruct-v0.2.Q4_0.gguf model, the increase in time taken is quite significant. I plotted some avg latencies on my system with different n_tokens using a modified version of speculative and putting timing around llama_decode(ctx_tgt, batch_tgt);:

image

There is more 5x jump in latency of llama_decode when n_tokens goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques like speculative and lookup decoding cannot give speed benefits for small draft sizes ( n_draft < 5) even if drafts are 100% correct, since autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once, so the advantage of speculation is lost.

I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?

To reproduce:

The easiest way to see this is running main with a one word prompt. The prompt eval time will be the time taken for the few prompt tokens, and eval time will show throughput for rest of tokens. e.g. ./main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e gives me

llama_print_timings:        load time =     385.80 ms
llama_print_timings:      sample time =       8.03 ms /   100 runs   (    0.08 ms per token, 12451.75 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.90 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1637.12 ms /    99 runs   (   16.54 ms per token,    60.47 tokens per second)
llama_print_timings:       total time =    1744.09 ms

which shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.

To see this effect in speculative, one can compare --draft 0 with --draft 1. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMO

draft = 0 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 0 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 0
n_predict = 101
n_drafted = 0
n_accept  = 0
accept    = nan%

draft:

llama_print_timings:        load time =     982.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      85.60 ms /     2 tokens (   42.80 ms per token,    23.36 tokens per second)
llama_print_timings:        eval time =    1653.63 ms /   101 runs   (   16.37 ms per token,    61.08 tokens per second)
llama_print_timings:       total time =    3453.52 ms

target:

llama_print_timings:        load time =     479.45 ms
llama_print_timings:      sample time =      17.57 ms /   101 runs   (    0.17 ms per token,  5750.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1676.51 ms /   102 runs   (   16.44 ms per token,    60.84 tokens per second)
llama_print_timings:       total time =    4460.08 ms

draft = 1 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 1 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 1
n_predict = 102
n_drafted = 36
n_accept  = 36
accept    = 100.000%

draft:

llama_print_timings:        load time =     960.89 ms
llama_print_timings:      sample time =     124.45 ms /     1 runs   (  124.45 ms per token,     8.04 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.91 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1701.90 ms /   102 runs   (   16.69 ms per token,    59.93 tokens per second)
llama_print_timings:       total time =    5584.70 ms

target:

llama_print_timings:        load time =     431.73 ms
llama_print_timings:      sample time =      19.67 ms /   102 runs   (    0.19 ms per token,  5184.77 tokens per second)
llama_print_timings: prompt eval time =    3076.34 ms /    72 tokens (   42.73 ms per token,    23.40 tokens per second)
llama_print_timings:        eval time =     520.40 ms /    31 runs   (   16.79 ms per token,    59.57 tokens per second)
llama_print_timings:       total time =    6569.38 ms

So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird.

@slaren
Copy link
Collaborator

slaren commented Dec 24, 2023

M3 Max:

./llama-bench -m models/mistral-q4_0.gguf -p 1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,32 -n 0
model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 1 63.62 ± 0.54
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 2 25.92 ± 0.09
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 3 38.87 ± 0.15
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 4 51.93 ± 0.06
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 5 64.29 ± 0.11
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 6 77.40 ± 0.31
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 7 89.91 ± 0.14
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 8 102.94 ± 0.26
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 10 127.68 ± 0.30
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 11 141.16 ± 0.22
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 12 155.10 ± 0.26
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 13 166.40 ± 1.12
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 14 179.20 ± 0.48
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 15 191.79 ± 0.71
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 16 204.70 ± 0.76
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 32 398.31 ± 0.52

build: b9f4795 (1699)

@slaren slaren added performance Speed related topics macos Issues specific to macOS labels Dec 24, 2023
@apoorvumang
Copy link
Author

thx for testing @slaren , I converted your t/s numbers into per call latency, and the pattern seems very similar (the conversion is number of tokens divided by t/s)
image

@kalomaze
Copy link
Contributor

kalomaze commented Dec 31, 2023

@ggerganov Is speculation happening on CPU / not parallelized? I hear reports that speculation works great for exllama2, but seems to provide no tangible gains on 70b for llama.cpp. I see significant value in speculation as an inference time optimization, but unfortunately it seems to provide no practical benefit for most users because of the massive latency overhead.

@ggerganov
Copy link
Owner

@apoorvumang Small-batch decoding with Metal needs some manual adjustments to get the best performance:

#3524 (comment)

I've provided an example for M2 Ultra, but I still don't know how to make this more generic:

llama.cpp/ggml-metal.m

Lines 1493 to 1518 in 120a1a5

// find the break-even point where the matrix-matrix kernel becomes more efficient compared
// to the matrix-vector kernel
int ne11_mm_min = 1;
#if 0
// the numbers below are measured on M2 Ultra for 7B and 13B models
// these numbers do not translate to other devices or model sizes
// TODO: need to find a better approach
if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) {
switch (src0t) {
case GGML_TYPE_F16: ne11_mm_min = 2; break;
case GGML_TYPE_Q8_0: ne11_mm_min = 7; break;
case GGML_TYPE_Q2_K: ne11_mm_min = 15; break;
case GGML_TYPE_Q3_K: ne11_mm_min = 7; break;
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1: ne11_mm_min = 15; break;
case GGML_TYPE_Q4_K: ne11_mm_min = 11; break;
case GGML_TYPE_Q5_0: // not tested yet
case GGML_TYPE_Q5_1: ne11_mm_min = 13; break; // not tested yet
case GGML_TYPE_Q5_K: ne11_mm_min = 7; break;
case GGML_TYPE_Q6_K: ne11_mm_min = 7; break;
default: ne11_mm_min = 1; break;
}
}
#endif

@kalomaze No, the speculative decoding examples utilize the GPU when available and work as expected when you take into account the batched decoding speed

@apoorvumang
Copy link
Author

Thx @ggerganov ! I will try to read and understand the metal implementation, and the speed tradeoffs on my system (M1 Max 32GB)

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed macos Issues specific to macOS performance Speed related topics stale
Projects
None yet
Development

No branches or pull requests

4 participants