-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama_decode
is significantly slower if n_tokens > 1
#4624
Comments
M3 Max:
build: b9f4795 (1699) |
thx for testing @slaren , I converted your t/s numbers into per call latency, and the pattern seems very similar (the conversion is number of tokens divided by |
@ggerganov Is speculation happening on CPU / not parallelized? I hear reports that speculation works great for exllama2, but seems to provide no tangible gains on 70b for llama.cpp. I see significant value in speculation as an inference time optimization, but unfortunately it seems to provide no practical benefit for most users because of the massive latency overhead. |
@apoorvumang Small-batch decoding with Metal needs some manual adjustments to get the best performance: I've provided an example for M2 Ultra, but I still don't know how to make this more generic: Lines 1493 to 1518 in 120a1a5
@kalomaze No, the speculative decoding examples utilize the GPU when available and work as expected when you take into account the batched decoding speed |
Thx @ggerganov ! I will try to read and understand the metal implementation, and the speed tradeoffs on my system (M1 Max 32GB) |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Issue
It is expected that
llama_decode
should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) withmistral-7b-instruct-v0.2.Q4_0.gguf
model, the increase in time taken is quite significant. I plotted some avg latencies on my system with differentn_tokens
using a modified version ofspeculative
and putting timing aroundllama_decode(ctx_tgt, batch_tgt);
:There is more 5x jump in latency of
llama_decode
whenn_tokens
goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques likespeculative
andlookup
decoding cannot give speed benefits for small draft sizes (n_draft < 5
) even if drafts are 100% correct, since autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once, so the advantage of speculation is lost.I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?
To reproduce:
The easiest way to see this is running
main
with a one word prompt. Theprompt eval time
will be the time taken for the few prompt tokens, andeval time
will show throughput for rest of tokens. e.g../main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e
gives mewhich shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.
To see this effect in
speculative
, one can compare--draft 0
with--draft 1
. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMOdraft = 0 command:
Timings:
draft = 1 command:
Timings:
So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird.
The text was updated successfully, but these errors were encountered: