Skip to content

July 2024: MoE performance comparison

Kawrakow edited this page Jan 20, 2025 · 1 revision

MoE models

There is PR-6840 from Justine Tunney in llama.cpp, but it has not been merged since April 23, so I'll compare performance to the master branch for Mixtral-8x7B. As Mixtral8x7B quantization is quite a lengthy process, the following table shows data only for Q4_K_S (a commonly used k-quant, 4 bit), Q5_0 (a legacy quant, 5 bit), and IQ4_XXS (a 3-bit i-quant)

model size backend threads test t/s (llama.cpp) t/s (iqk_mul_mat) Speedup
8x7B Q4_K_S 48.75 GiB AVX2 16 pp512 54.92 ± 0.23 102.94 ± 0.37 1.874
NEON 8 pp512 23.54 ± 1.56 38.32 ± 0.54 1.628
AVX2 4 tg128 7.80 ± 0.07 7.83 ± 0.09 1.004
NEON 8 tg128 14.95 ± 0.25 15.28 ± 0.24 2.022
8x7B IQ3_XXS 33.07 GiB AVX2 16 pp512 17.58 ± 0.04 68.45 ± 0.22 3.894
NEON 8 pp512 7.75 ± 0.04 34.67 ± 0.40 4.474
AVX2 4 tg128 4.60 ± 0.01 5.45 ± 0.09 1.185
AVX2 8 tg128 8.04 ± 0.65 9.83 ± 0.06 1.223
AVX2 16 tg128 10.42 ± 0.01 10.57 ± 0.01 1.014
NEON 8 tg128 6.19 ± 1.16 7.27 ± 0.14 1.174
8x7B Q5_0 59.11 GiB AVX2 16 pp512 29.06 ± 0.43 62.67 ± 0.32 2.157
NEON 8 pp512 15.17 ± 0.51 27.36 ± 1.03 1.804
AVX2 4 tg128 5.44 ± 0.10 6.81 ± 0.06 1.252
NEON 8 tg128 12.03 ± 0.77 12.41 ± 1.27 1.032