AVX BF16 and single scale quant optimizations #10212

netrunnereve · 2024-11-08T03:24:14Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This PR adds AVX support for BF16 and has a faster and cleaner version of Q4_0, Q8_0, and IQ4_NL ggml_vec_dot. I don't really use those old quants but they're easy to understand/implement and any learnings can be used on the K-quants as well.

Like I mentioned in #10118 a lot of the changes here should also be applicable to the AVX2 implementation if someone want to work on that.

Benchmarks (Llamafile turned off)

model	size	params	backend	threads	test	t/s	speedup
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	8	pp512	9.56 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	8	tg128	7.02 ± 0.01
llama 8B Q4_0 (PR)	4.33 GiB	8.03 B	CPU	8	pp512	10.06 ± 0.03	5%
llama 8B Q4_0 (PR)	4.33 GiB	8.03 B	CPU	8	tg128	7.30 ± 0.01	4%
llama 8B IQ4_NL - 4.5 bpw	4.35 GiB	8.03 B	CPU	8	pp512	9.18 ± 0.08
llama 8B IQ4_NL - 4.5 bpw	4.35 GiB	8.03 B	CPU	8	tg128	6.57 ± 0.04
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	9.87 ± 0.03	8%
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	7.19 ± 0.04	9%
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	pp512	8.53 ± 0.04
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	tg128	4.81 ± 0.00
llama 8B Q8_0 (PR)	7.95 GiB	8.03 B	CPU	8	pp512	10.81 ± 0.01	27%
llama 8B Q8_0 (PR)	7.95 GiB	8.03 B	CPU	8	tg128	4.95 ± 0.00	3%

For BF16 the bench for master takes way too long so here's a perf report instead.

Master
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    138 runs -  8909.02 us/run - 117.44 MFLOP/run -  13.18 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1 runs - 4143355.00 us/run -  60.13 GFLOP/run -  14.51 GFLOPS

PR
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    414 runs -  2843.62 us/run - 117.44 MFLOP/run -  41.30 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2 runs - 780635.00 us/run -  60.13 GFLOP/run -  77.03 GFLOPS

…nto avx_opt

netrunnereve · 2024-11-12T23:49:04Z

With Q4_0 and only Q4_0 we can optimize this further by adding the 16-bit products together, allowing us to do less conversions to 32-bit. With IQ4_NL and Q8_0 there's an overflow risk if our quantized weights are right at the 8-bit limit.

model	size	params	backend	threads	test	t/s	speedup
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	8	pp512	9.56 ± 0.01
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	8	tg128	7.02 ± 0.01
llama 8B Q4_0 (PR)	4.33 GiB	8.03 B	CPU	8	pp512	10.65 ± 0.05	11%
llama 8B Q4_0 (PR)	4.33 GiB	8.03 B	CPU	8	tg128	7.70 ± 0.01	10%

* use 128 bit loads (i've tried 256->128 to death and its slower) * double accumulator * avx bf16 vec dot * +3% q4_0 inference * +7% tg +5% pp compared to master * slower f16c version, kep for reference * 256b version, also slow. i tried :) * revert f16 * faster with madd * split to functions * Q8_0 and IQ4_NL, 5-7% faster * fix potential overflow (performance reduced) * 16 bit add for q4_0 only * merge

netrunnereve added 18 commits November 1, 2024 21:06

use 128 bit loads (i've tried 256->128 to death and its slower)

ad01d31

double accumulator

34b9f0d

avx bf16 vec dot

dca0deb

+3% q4_0 inference

e069375

+7% tg +5% pp compared to master

fffe7e6

slower f16c version, kep for reference

f8dd133

256b version, also slow. i tried :)

1335c78

revert f16

629befc

faster with madd

7de0bdc

split to functions

b8d592f

Q8_0 and IQ4_NL, 5-7% faster

6667ede

fix potential overflow (performance reduced)

6a4c080

Merge branch 'ggerganov:master' into avx_opt

a83ac00

rebase to master

b0e9b96

Merge https://github.com/ggerganov/llama.cpp into avx_opt

ec6366f

Merge branch 'ggerganov:master' into avx_opt

8c29230

Merge branch 'avx_opt' of https://github.com/netrunnereve/llama.cpp i…

54e6c88

…nto avx_opt

Merge https://github.com/ggerganov/llama.cpp into avx_opt

13dfe63

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 8, 2024

slaren approved these changes Nov 9, 2024

View reviewed changes

netrunnereve marked this pull request as draft November 12, 2024 22:29

netrunnereve added 2 commits November 12, 2024 18:47

16 bit add for q4_0 only

a847973

Merge branch 'ggerganov:master' into avx_opt

c54b67c

netrunnereve marked this pull request as ready for review November 13, 2024 00:12

netrunnereve added 2 commits November 14, 2024 19:18

pull in master (same changes removed)

9352321

merge

f281ca3

slaren merged commit 1842922 into ggerganov:master Nov 15, 2024
53 checks passed

netrunnereve deleted the avx_opt branch November 16, 2024 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX BF16 and single scale quant optimizations #10212

AVX BF16 and single scale quant optimizations #10212

netrunnereve commented Nov 8, 2024

netrunnereve commented Nov 12, 2024

AVX BF16 and single scale quant optimizations #10212

AVX BF16 and single scale quant optimizations #10212

Conversation

netrunnereve commented Nov 8, 2024

netrunnereve commented Nov 12, 2024