ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Vithulep · 2025-02-17T03:58:13Z

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q3_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #7433 and #11227.

This PR contains the SVE implementation of the vector dot used to compute the Q3_K quantization.
By running a Q3_K quantized model of mistral-7b-v01, on Graviton 3 (Perf 01 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.02 to x1.15 faster than the NEON implementation.

Decoding Throughput (TPOT)

Threads	NEON (original)	This PR(SVE)	Ratio
2	4.21	4.86	1.15
4	8.26	9.37	1.13
8	15.90	17.49	1.10
16	29.09	31.05	1.06
32	42.59	43.80	1.03
48	48.36	49.41	1.02

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I also verified that perplexity matches between the NEON and SVE Implementation.

NEON (original)	SVE (this PR)
2.9394 +/- 0.35779	2.9394 +/- 0.35779

ggerganov

Improve the formatting of the code to be more consistent with the rest of the code. I've given a few hints below.

ggml/src/ggml-cpu/ggml-cpu-quants.c

Vithulep · 2025-02-18T04:51:32Z

Improve the formatting of the code to be more consistent with the rest of the code. I've given a few hints below.

Thank you. Improved the code formatting for consistency.

ggerganov

Haven't ran any tests myself, so taking a small leap of faith here, assuming you've done all the necessary tests for this change.

ggml/src/ggml-cpu/ggml-cpu-quants.c

Vithulep · 2025-02-21T03:30:52Z

Haven't ran any tests myself, so taking a small leap of faith here, assuming you've done all the necessary tests for this change.

Thank you! We've done all the necessary tests for this change.

Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

e2fdc47

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 17, 2025

ggerganov reviewed Feb 17, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

Improved Formating of code in ggml-cpu-quants.c file

3b10dff

ggerganov approved these changes Feb 20, 2025

View reviewed changes

style : minor fixes

d4f0941

ggerganov reviewed Feb 20, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

ggerganov reviewed Feb 20, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

style : less whitespaces

da6f6b9

ggerganov reviewed Feb 20, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

style : ptr spaceing

bc44992

ggerganov merged commit 4806498 into ggml-org:master Feb 20, 2025
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Vithulep commented Feb 17, 2025 •

edited

Loading

ggerganov left a comment

Vithulep commented Feb 18, 2025

ggerganov left a comment

Vithulep commented Feb 21, 2025

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Conversation

Vithulep commented Feb 17, 2025 • edited Loading

Performance

Perplexity

ggerganov left a comment

Choose a reason for hiding this comment

Vithulep commented Feb 18, 2025

ggerganov left a comment

Choose a reason for hiding this comment

Vithulep commented Feb 21, 2025

Vithulep commented Feb 17, 2025 •

edited

Loading