Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Merged
merged 5 commits into from
Feb 20, 2025

Conversation

Vithulep
Copy link
Contributor

@Vithulep Vithulep commented Feb 17, 2025

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q3_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #7433 and #11227.

This PR contains the SVE implementation of the vector dot used to compute the Q3_K quantization.
By running a Q3_K quantized model of mistral-7b-v01, on Graviton 3 (Perf 01 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.02 to x1.15 faster than the NEON implementation.

  • Decoding Throughput (TPOT)
Threads NEON (original) This PR(SVE) Ratio
2 4.21 4.86 1.15
4 8.26 9.37 1.13
8 15.90 17.49 1.10
16 29.09 31.05 1.06
32 42.59 43.80 1.03
48 48.36 49.41 1.02

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I also verified that perplexity matches between the NEON and SVE Implementation.

NEON (original) SVE (this PR)
2.9394 +/- 0.35779 2.9394 +/- 0.35779

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 17, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the formatting of the code to be more consistent with the rest of the code. I've given a few hints below.

@Vithulep
Copy link
Contributor Author

Improve the formatting of the code to be more consistent with the rest of the code. I've given a few hints below.

Thank you. Improved the code formatting for consistency.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't ran any tests myself, so taking a small leap of faith here, assuming you've done all the necessary tests for this change.

@ggerganov ggerganov merged commit 4806498 into ggml-org:master Feb 20, 2025
45 checks passed
@Vithulep
Copy link
Contributor Author

Haven't ran any tests myself, so taking a small leap of faith here, assuming you've done all the necessary tests for this change.

Thank you! We've done all the necessary tests for this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants