Benchmarks for llama quantised models with gguf #844

okpatil4u · 2023-09-14T11:22:11Z

M2 Ultra 26 Cores 64 GB

With Candle

Model	Question	Threads	CPU	Time taken	Token/sec
Codellama-7b.Q5_K_M.gguf	Quick sort implementation in c	1	98%	37.003	3.1
		2	174%	21.572	5.4
		4	287%	13.771	8.75
		8	454%	11.198	11.01
		20	1159%	12.58	9.65
Codellama-34b.Q5_K_M.gguf	Quick sort implementation in c	1	99%	2:56.08	0.65
		2	181%	1:33.53	1.26
		4	313%	54.522	2.28
		8	509%	38.97	3.42
		20	966%	37.286	3.62

vs Llama.cpp

Model	Question	Cores used	CPU utilization	eval tokens/sec
codellama-7b-python.Q5_K_M.gguf	Explain how embeddings work in large language models	20	1790% CPU	18.58
	ngl 1		23% CPU	64.25
wizardcoder-python-34b-v1.0.Q5_K_.gguf	Explain how embeddings work in large language models	20	1793% CPU	4.32
	ngl 1		10% CPU	18.31

Any pointers on how one can improve performance through candle ?

Also, I am trying to implement speculative sampling through candle. Do you think if the implementation is feasible ?

LaurentMazare · 2023-09-14T13:29:20Z

Are these properly excluding the initial prompt evaluation time on both sides?
The 3.62 token/s (candle) vs 4.32 token/s (llama.cpp) doesn't actually look that bad. You could try to enable the --tracing mode to dig a bit on where the time is spent.
The 9.65 token/s vs 18.58 token/s is a lot worse, so probably worth investigating this one if you can.
Also just to be sure, on the candle side you're using --features accelerate right?

The gpu support isn't available yet for mac so all the candle numbers are cpu only (+ maybe the neural engine with accelerate).

okpatil4u · 2023-09-14T13:33:41Z

I will check and get to you tomorrow. In the meantime, could you let me know if something like speculative sampling is possible using candle ?

…

On Thu, 14 Sep 2023 at 6:59 PM, Laurent Mazare ***@***.***> wrote: Are these properly excluding the initial prompt evaluation time on both sides? The 3.62 token/s (candle) vs 4.32 token/s (llama.cpp) doesn't actually look that bad. You could try to enable the --tracing mode to dig a bit on where the time is spent. The 9.65 token/s vs 18.58 token/s is a lot worse, so probably worth investigating this one if you can. Also just to be sure, on the candle side you're using --features accelerate right? The gpu support isn't available yet for mac so all the candle numbers are cpu only (+ maybe the neural engine with accelerate). — Reply to this email directly, view it on GitHub <#844 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4AOZIMXVWQB7UZSUALX2MBDXANCNFSM6AAAAAA4X57FKY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

LaurentMazare · 2023-09-14T13:43:14Z

I imagine that it might be possible but not easy, we're likely to add gpu support before that.
Also just to mention that the goal of the quantized example is not really to provide a full featured llama.cpp equivalent but rather be an example of how to use quantized models. So we would prefer not to add too much complexity there, and would certainly be happy if some new projects are created to build a more feature complete and performant version.

okpatil4u · 2023-09-14T13:50:13Z

I meant to say that if we try it at our end, is it technically feasible ? I have 4 developers learning and using candle to build a few prototypes. (candle is their first ML framework ❤️). I didn’t want choose a road with a dead end. Anyway, you have already have your hands full and the breakneck speed you have been working with is amazing. Thank you for all your help !!

…

On Thu, 14 Sep 2023 at 7:13 PM, Laurent Mazare ***@***.***> wrote: I imagine that it might be possible but not easy, we're likely to add gpu support before that. Also just to mention that the goal of the quantized example is not really to provide a full featured llama.cpp equivalent but rather be an example of how to use quantized models. So we would prefer not to add too much complexity there, and would certainly be happy if some new projects are created to build a more feature complete and performant version. — Reply to this email directly, view it on GitHub <#844 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4CCN3IA6NBSTG32HLTX2MCX3ANCNFSM6AAAAAA4X57FKY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

LaurentMazare · 2023-09-14T15:15:56Z

Well I'm not very familiar with the details, but I don't see a reason why it wouldn't. Running with multiple elements in a batch should be well supported and I think its the only required thing on the candle side? Best is probably to give it a try and see what happens :)
Just let us know if you run into any issues or need anything more inside candle and we can certainly have a look!

LLukas22 · 2023-09-14T15:52:41Z

@okpatil4u The qmatmul implementation is currently far from optimal, and could probably be improved with some better thread management and better allocation. Feel free to look into it.

okpatil4u · 2023-09-14T17:06:16Z

Thbk you Lukas, Laurent. This is super helpful.

…

On Thu, 14 Sep 2023 at 9:22 PM, Lukas Kreussel ***@***.***> wrote: @okpatil4u <https://github.com/okpatil4u> The qmatmul implementation is currently far from optimal, and could probably be improved with some better thread management and better allocation. Feel free to look into it. — Reply to this email directly, view it on GitHub <#844 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4B5YYZTEEJFY7KP57DX2MR5HANCNFSM6AAAAAA4X57FKY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

okpatil4u closed this as completed Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks for llama quantised models with gguf #844

Benchmarks for llama quantised models with gguf #844

okpatil4u commented Sep 14, 2023 •

edited

Loading

LaurentMazare commented Sep 14, 2023

okpatil4u commented Sep 14, 2023 via email

LaurentMazare commented Sep 14, 2023

okpatil4u commented Sep 14, 2023 via email

LaurentMazare commented Sep 14, 2023

LLukas22 commented Sep 14, 2023

okpatil4u commented Sep 14, 2023 via email

Benchmarks for llama quantised models with gguf #844

Benchmarks for llama quantised models with gguf #844

Comments

okpatil4u commented Sep 14, 2023 • edited Loading

LaurentMazare commented Sep 14, 2023

okpatil4u commented Sep 14, 2023 via email

LaurentMazare commented Sep 14, 2023

okpatil4u commented Sep 14, 2023 via email

LaurentMazare commented Sep 14, 2023

LLukas22 commented Sep 14, 2023

okpatil4u commented Sep 14, 2023 via email

okpatil4u commented Sep 14, 2023 •

edited

Loading