Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX BF16 and single scale quant optimizations #10212

Merged
merged 22 commits into from
Nov 15, 2024

Conversation

netrunnereve
Copy link
Collaborator

This PR adds AVX support for BF16 and has a faster and cleaner version of Q4_0, Q8_0, and IQ4_NL ggml_vec_dot. I don't really use those old quants but they're easy to understand/implement and any learnings can be used on the K-quants as well.

Like I mentioned in #10118 a lot of the changes here should also be applicable to the AVX2 implementation if someone want to work on that.

Benchmarks (Llamafile turned off)

model size params backend threads test t/s speedup
llama 8B Q4_0 4.33 GiB 8.03 B CPU 8 pp512 9.56 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 8 tg128 7.02 ± 0.01
llama 8B Q4_0 (PR) 4.33 GiB 8.03 B CPU 8 pp512 10.06 ± 0.03 5%
llama 8B Q4_0 (PR) 4.33 GiB 8.03 B CPU 8 tg128 7.30 ± 0.01 4%
llama 8B IQ4_NL - 4.5 bpw 4.35 GiB 8.03 B CPU 8 pp512 9.18 ± 0.08
llama 8B IQ4_NL - 4.5 bpw 4.35 GiB 8.03 B CPU 8 tg128 6.57 ± 0.04
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 pp512 9.87 ± 0.03 8%
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 tg128 7.19 ± 0.04 9%
llama 8B Q8_0 7.95 GiB 8.03 B CPU 8 pp512 8.53 ± 0.04
llama 8B Q8_0 7.95 GiB 8.03 B CPU 8 tg128 4.81 ± 0.00
llama 8B Q8_0 (PR) 7.95 GiB 8.03 B CPU 8 pp512 10.81 ± 0.01 27%
llama 8B Q8_0 (PR) 7.95 GiB 8.03 B CPU 8 tg128 4.95 ± 0.00 3%

For BF16 the bench for master takes way too long so here's a perf report instead.

Master
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    138 runs -  8909.02 us/run - 117.44 MFLOP/run -  13.18 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1 runs - 4143355.00 us/run -  60.13 GFLOP/run -  14.51 GFLOPS

PR
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    414 runs -  2843.62 us/run - 117.44 MFLOP/run -  41.30 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2 runs - 780635.00 us/run -  60.13 GFLOP/run -  77.03 GFLOPS

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 8, 2024
@netrunnereve netrunnereve marked this pull request as draft November 12, 2024 22:29
@netrunnereve
Copy link
Collaborator Author

With Q4_0 and only Q4_0 we can optimize this further by adding the 16-bit products together, allowing us to do less conversions to 32-bit. With IQ4_NL and Q8_0 there's an overflow risk if our quantized weights are right at the 8-bit limit.

model size params backend threads test t/s speedup
llama 8B Q4_0 4.33 GiB 8.03 B CPU 8 pp512 9.56 ± 0.01
llama 8B Q4_0 4.33 GiB 8.03 B CPU 8 tg128 7.02 ± 0.01
llama 8B Q4_0 (PR) 4.33 GiB 8.03 B CPU 8 pp512 10.65 ± 0.05 11%
llama 8B Q4_0 (PR) 4.33 GiB 8.03 B CPU 8 tg128 7.70 ± 0.01 10%

@netrunnereve netrunnereve marked this pull request as ready for review November 13, 2024 00:12
@slaren slaren merged commit 1842922 into ggerganov:master Nov 15, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
@netrunnereve netrunnereve deleted the avx_opt branch November 16, 2024 01:50
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 17, 2024
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants