`llama_decode` is significantly slower if `n_tokens > 1` #4624

apoorvumang · 2023-12-24T23:05:48Z

Issue

It is expected that llama_decode should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) with mistral-7b-instruct-v0.2.Q4_0.gguf model, the increase in time taken is quite significant. I plotted some avg latencies on my system with different n_tokens using a modified version of speculative and putting timing around llama_decode(ctx_tgt, batch_tgt);:

There is more 5x jump in latency of llama_decode when n_tokens goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques like speculative and lookup decoding cannot give speed benefits for small draft sizes ( n_draft < 5) even if drafts are 100% correct, since autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once, so the advantage of speculation is lost.

I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?

To reproduce:

The easiest way to see this is running main with a one word prompt. The prompt eval time will be the time taken for the few prompt tokens, and eval time will show throughput for rest of tokens. e.g. ./main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e gives me

llama_print_timings:        load time =     385.80 ms
llama_print_timings:      sample time =       8.03 ms /   100 runs   (    0.08 ms per token, 12451.75 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.90 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1637.12 ms /    99 runs   (   16.54 ms per token,    60.47 tokens per second)
llama_print_timings:       total time =    1744.09 ms

which shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.

To see this effect in speculative, one can compare --draft 0 with --draft 1. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMO

draft = 0 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 0 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 0
n_predict = 101
n_drafted = 0
n_accept  = 0
accept    = nan%

draft:

llama_print_timings:        load time =     982.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      85.60 ms /     2 tokens (   42.80 ms per token,    23.36 tokens per second)
llama_print_timings:        eval time =    1653.63 ms /   101 runs   (   16.37 ms per token,    61.08 tokens per second)
llama_print_timings:       total time =    3453.52 ms

target:

llama_print_timings:        load time =     479.45 ms
llama_print_timings:      sample time =      17.57 ms /   101 runs   (    0.17 ms per token,  5750.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1676.51 ms /   102 runs   (   16.44 ms per token,    60.84 tokens per second)
llama_print_timings:       total time =    4460.08 ms

draft = 1 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 1 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 1
n_predict = 102
n_drafted = 36
n_accept  = 36
accept    = 100.000%

draft:

llama_print_timings:        load time =     960.89 ms
llama_print_timings:      sample time =     124.45 ms /     1 runs   (  124.45 ms per token,     8.04 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.91 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1701.90 ms /   102 runs   (   16.69 ms per token,    59.93 tokens per second)
llama_print_timings:       total time =    5584.70 ms

target:

llama_print_timings:        load time =     431.73 ms
llama_print_timings:      sample time =      19.67 ms /   102 runs   (    0.19 ms per token,  5184.77 tokens per second)
llama_print_timings: prompt eval time =    3076.34 ms /    72 tokens (   42.73 ms per token,    23.40 tokens per second)
llama_print_timings:        eval time =     520.40 ms /    31 runs   (   16.79 ms per token,    59.57 tokens per second)
llama_print_timings:       total time =    6569.38 ms

So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird.

The text was updated successfully, but these errors were encountered:

slaren · 2023-12-24T23:16:52Z

M3 Max:

./llama-bench -m models/mistral-q4_0.gguf -p 1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,32 -n 0

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 1	63.62 ± 0.54
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 2	25.92 ± 0.09
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 3	38.87 ± 0.15
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 4	51.93 ± 0.06
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 5	64.29 ± 0.11
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 6	77.40 ± 0.31
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 7	89.91 ± 0.14
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 8	102.94 ± 0.26
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 10	127.68 ± 0.30
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 11	141.16 ± 0.22
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 12	155.10 ± 0.26
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 13	166.40 ± 1.12
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 14	179.20 ± 0.48
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 15	191.79 ± 0.71
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 16	204.70 ± 0.76
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 32	398.31 ± 0.52

build: b9f4795 (1699)

apoorvumang · 2023-12-27T18:39:36Z

thx for testing @slaren , I converted your t/s numbers into per call latency, and the pattern seems very similar (the conversion is number of tokens divided by t/s)

kalomaze · 2023-12-31T21:05:05Z

@ggerganov Is speculation happening on CPU / not parallelized? I hear reports that speculation works great for exllama2, but seems to provide no tangible gains on 70b for llama.cpp. I see significant value in speculation as an inference time optimization, but unfortunately it seems to provide no practical benefit for most users because of the massive latency overhead.

ggerganov · 2024-01-02T12:17:57Z

@apoorvumang Small-batch decoding with Metal needs some manual adjustments to get the best performance:

#3524 (comment)

I've provided an example for M2 Ultra, but I still don't know how to make this more generic:

llama.cpp/ggml-metal.m

Lines 1493 to 1518 in 120a1a5

    
                                       // find the break-even point where the matrix-matrix kernel becomes more efficient compared 
        
                                       // to the matrix-vector kernel 
        
                                       int ne11_mm_min = 1; 
        
           #if 0 
        
                                       // the numbers below are measured on M2 Ultra for 7B and 13B models 
        
                                       // these numbers do not translate to other devices or model sizes 
        
                                       // TODO: need to find a better approach 
        
                                       if ([ctx->device.name isEqualToString:@"Apple M2 Ultra"]) { 
        
                                           switch (src0t) { 
        
                                               case GGML_TYPE_F16:  ne11_mm_min = 2;  break; 
        
                                               case GGML_TYPE_Q8_0: ne11_mm_min = 7;  break; 
        
                                               case GGML_TYPE_Q2_K: ne11_mm_min = 15; break; 
        
                                               case GGML_TYPE_Q3_K: ne11_mm_min = 7;  break; 
        
                                               case GGML_TYPE_Q4_0: 
        
                                               case GGML_TYPE_Q4_1: ne11_mm_min = 15; break; 
        
                                               case GGML_TYPE_Q4_K: ne11_mm_min = 11; break; 
        
                                               case GGML_TYPE_Q5_0:                          // not tested yet 
        
                                               case GGML_TYPE_Q5_1: ne11_mm_min = 13; break; // not tested yet 
        
                                               case GGML_TYPE_Q5_K: ne11_mm_min = 7;  break; 
        
                                               case GGML_TYPE_Q6_K: ne11_mm_min = 7;  break; 
        
                                               default:             ne11_mm_min = 1;  break; 
        
                                           } 
        
                                       } 
        
           #endif

@kalomaze No, the speculative decoding examples utilize the GPU when available and work as expected when you take into account the batched decoding speed

apoorvumang · 2024-01-03T09:44:00Z

Thx @ggerganov ! I will try to read and understand the metal implementation, and the speed tradeoffs on my system (M1 Max 32GB)

github-actions · 2024-03-18T01:35:23Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:10:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

apoorvumang added the bug-unconfirmed label Dec 24, 2023

slaren added performance Speed related topics macos Issues specific to macOS labels Dec 24, 2023

apoorvumang mentioned this issue Jan 18, 2024

Prompt lookup decoding #4484

Merged

3 tasks

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`llama_decode` is significantly slower if `n_tokens > 1` #4624

`llama_decode` is significantly slower if `n_tokens > 1` #4624

apoorvumang commented Dec 24, 2023

slaren commented Dec 24, 2023

apoorvumang commented Dec 27, 2023

kalomaze commented Dec 31, 2023 •

edited

Loading

ggerganov commented Jan 2, 2024

apoorvumang commented Jan 3, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

llama_decode is significantly slower if n_tokens > 1 #4624

llama_decode is significantly slower if n_tokens > 1 #4624

Comments

apoorvumang commented Dec 24, 2023

Issue

To reproduce:

slaren commented Dec 24, 2023

apoorvumang commented Dec 27, 2023

kalomaze commented Dec 31, 2023 • edited Loading

ggerganov commented Jan 2, 2024

apoorvumang commented Jan 3, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

`llama_decode` is significantly slower if `n_tokens > 1` #4624

`llama_decode` is significantly slower if `n_tokens > 1` #4624

kalomaze commented Dec 31, 2023 •

edited

Loading