-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks for llama quantised models with gguf #844
Comments
Are these properly excluding the initial prompt evaluation time on both sides? The gpu support isn't available yet for mac so all the candle numbers are cpu only (+ maybe the neural engine with accelerate). |
I will check and get to you tomorrow. In the meantime, could you let me
know if something like speculative sampling is possible using candle ?
…On Thu, 14 Sep 2023 at 6:59 PM, Laurent Mazare ***@***.***> wrote:
Are these properly excluding the initial prompt evaluation time on both
sides?
The 3.62 token/s (candle) vs 4.32 token/s (llama.cpp) doesn't actually
look that bad. You could try to enable the --tracing mode to dig a bit on
where the time is spent.
The 9.65 token/s vs 18.58 token/s is a lot worse, so probably worth
investigating this one if you can.
Also just to be sure, on the candle side you're using --features
accelerate right?
The gpu support isn't available yet for mac so all the candle numbers are
cpu only (+ maybe the neural engine with accelerate).
—
Reply to this email directly, view it on GitHub
<#844 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4AOZIMXVWQB7UZSUALX2MBDXANCNFSM6AAAAAA4X57FKY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I imagine that it might be possible but not easy, we're likely to add gpu support before that. |
I meant to say that if we try it at our end, is it technically feasible ? I
have 4 developers learning and using candle to build a few prototypes.
(candle is their first ML framework ❤️). I didn’t want choose a road with a
dead end.
Anyway, you have already have your hands full and the breakneck speed you
have been working with is amazing. Thank you for all your help !!
…On Thu, 14 Sep 2023 at 7:13 PM, Laurent Mazare ***@***.***> wrote:
I imagine that it might be possible but not easy, we're likely to add gpu
support before that.
Also just to mention that the goal of the quantized example is not really
to provide a full featured llama.cpp equivalent but rather be an example of
how to use quantized models. So we would prefer not to add too much
complexity there, and would certainly be happy if some new projects are
created to build a more feature complete and performant version.
—
Reply to this email directly, view it on GitHub
<#844 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4CCN3IA6NBSTG32HLTX2MCX3ANCNFSM6AAAAAA4X57FKY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Well I'm not very familiar with the details, but I don't see a reason why it wouldn't. Running with multiple elements in a batch should be well supported and I think its the only required thing on the candle side? Best is probably to give it a try and see what happens :) |
@okpatil4u The |
Thbk you Lukas, Laurent. This is super helpful.
…On Thu, 14 Sep 2023 at 9:22 PM, Lukas Kreussel ***@***.***> wrote:
@okpatil4u <https://github.com/okpatil4u> The qmatmul implementation is
currently far from optimal, and could probably be improved with some better
thread management and better allocation. Feel free to look into it.
—
Reply to this email directly, view it on GitHub
<#844 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4B5YYZTEEJFY7KP57DX2MR5HANCNFSM6AAAAAA4X57FKY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
M2 Ultra 26 Cores 64 GB
With Candle
vs Llama.cpp
Any pointers on how one can improve performance through candle ?
Also, I am trying to implement speculative sampling through candle. Do you think if the implementation is feasible ?
The text was updated successfully, but these errors were encountered: