kompute : llama-bench support and ggml_cpu_has_kompute() #5226

cebtenzzre · 2024-01-30T21:23:55Z

I didn't realize that the Kompute backend should have been added in these places.

slaren

It doesn't really need to be added to ggml.c, eventually all of the backend code will be removed from there, and the llama.cpp change does nothing since the CPU backend is no longer running at the same time as the GPU backends, it's just a leftover from the pre-ggml-backend implementation that I forget to remove. Anyway it doesn't really matter, the changes to llama-bench and common are good.

cebtenzzre · 2024-01-30T21:40:36Z

@slaren I removed that code from llama.cpp, does that seem right?

stduhpf · 2024-01-31T11:31:13Z

Somehow this PR huts performance for the other Vulkan backend? This doesn't make sense when I look at the changes, but the difference with the previous commit on master is very significant. (It's not just with llama-bench.)

Before merge:

ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	Vulkan	33	pp 512	225.55 ± 1.52
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	Vulkan	33	tg 128	43.33 ± 0.35

build: e0085fd (2026)

After merge:

ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	Vulkan	33	pp 512	197.63 ± 16.47
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	Vulkan	33	tg 128	21.54 ± 2.20

build: e8dc55d (2027)

EDIT: reverting 3536cf6 fixes it.

slaren · 2024-01-31T11:55:39Z

The only way I can imagine this could make a difference is there is a large overhead for launching the extra threads for the get_rows operation that still runs on the CPU. Are you on Windows?

stduhpf · 2024-01-31T11:55:54Z

I observe a simillar drop of performance with the Kompute backend

Lastest master

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Kompute	33	pp 512	68.04 ± 1.70
llama 7B Q4_0	3.56 GiB	6.74 B	Kompute	33	tg 128	15.42 ± 2.18

With `3536cf6` reverted

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Kompute	33	pp 512	71.44 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Kompute	33	tg 128	20.79 ± 2.75

stduhpf · 2024-01-31T11:56:20Z

The only way I can imagine this could make a difference is there is a large overhead for launching the extra threads for the get_rows operation that still runs on the CPU. Are you on Windows?

Yes, on Windows 10

stduhpf · 2024-01-31T12:00:38Z

Setting -t 1 manually works too.

)

kompute : llama-bench support and ggml_cpu_has_kompute()

e3b420a

cebtenzzre requested a review from ggerganov January 30, 2024 21:23

slaren approved these changes Jan 30, 2024

View reviewed changes

llama : remove obsolete set of n_threads=1

3536cf6

cebtenzzre removed the request for review from ggerganov January 31, 2024 00:04

cebtenzzre merged commit e8dc55d into master Jan 31, 2024
54 checks passed

cebtenzzre deleted the ceb/kompute-llama-bench branch January 31, 2024 00:04

slaren mentioned this pull request Feb 1, 2024

ggml : add optional CPU backend context, support reusing threads, async compute ggerganov/ggml#721

Closed

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

kompute : llama-bench support and ggml_cpu_has_kompute() (ggerganov#5226

98725e8

)

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

kompute : llama-bench support and ggml_cpu_has_kompute() (ggerganov#5226

093842f

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kompute : llama-bench support and ggml_cpu_has_kompute() #5226

kompute : llama-bench support and ggml_cpu_has_kompute() #5226

cebtenzzre commented Jan 30, 2024

slaren left a comment

cebtenzzre commented Jan 30, 2024

stduhpf commented Jan 31, 2024 •

edited

Loading

slaren commented Jan 31, 2024

stduhpf commented Jan 31, 2024

stduhpf commented Jan 31, 2024

stduhpf commented Jan 31, 2024

kompute : llama-bench support and ggml_cpu_has_kompute() #5226

kompute : llama-bench support and ggml_cpu_has_kompute() #5226

Conversation

cebtenzzre commented Jan 30, 2024

slaren left a comment

Choose a reason for hiding this comment

cebtenzzre commented Jan 30, 2024

stduhpf commented Jan 31, 2024 • edited Loading

Before merge:

After merge:

slaren commented Jan 31, 2024

stduhpf commented Jan 31, 2024

Lastest master

With 3536cf6 reverted

stduhpf commented Jan 31, 2024

stduhpf commented Jan 31, 2024

stduhpf commented Jan 31, 2024 •

edited

Loading

With `3536cf6` reverted