AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

sw · 2023-04-21T09:25:15Z

Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into inline functions.

slaren · 2023-04-21T13:55:59Z

q4_3
42.94 seconds per pass - ETA 7.81 hours
prompt eval time = 54411.09 ms /   631 tokens (   86.23 ms per token) bs=512
prompt eval time = 59126.51 ms /   631 tokens (   93.70 ms per token) bs=8
       eval time = 41742.69 ms /   255 runs   (  163.70 ms per run)

q4_1
35.42 seconds per pass - ETA 6.45 hours
prompt eval time = 52762.72 ms /   631 tokens (   83.62 ms per token) bs=512
prompt eval time = 56287.43 ms /   631 tokens (   89.20 ms per token) bs=8
       eval time = 41024.48 ms /   255 runs   (  160.88 ms per run)

Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there.

ggerganov · 2023-04-21T15:23:12Z

Before merging this: the current Q4_3 format / implementation is not very efficient with ARM NEON:

Time per token on M1 Pro:

Q4_0 : 48ms
Q4_1 : 55ms
Q4_2 : 48ms
Q4_3 : 87ms

I want to make it close to ~50-60 ms / token.
But I think we might have to change the format if the optimization from #1083 does not work out.

Will try to optimize this with highest priority, so we can decide on the final Q4_3 format

sw · 2023-04-21T16:04:29Z

Well #1083 was a bit rushed IMO, but I tried to address the loose ends.

For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX.

sw · 2023-04-21T16:35:26Z

Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for quantize_row_q8_0, but I'm not sure I can trust the compiler to verify that I'm not accidentally using AVX2. It would be great if someone with an AVX-only machine could test this.

ggml.c

ggerganov

The Q4_3 format seems will remain unchanged as it is on master, so let's merge this.
If the AVX-only path has issues we will resolve later

After merge, will try to rebase #1109 and merge it as well

slaren approved these changes Apr 21, 2023

View reviewed changes

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring

63d8dff

sw force-pushed the q43-avx branch from a32f581 to 63d8dff Compare April 21, 2023 16:03

sw marked this pull request as draft April 21, 2023 16:14

finish AVX vectorization of quantize_row_q8_0

535ea47

sw marked this pull request as ready for review April 21, 2023 16:41

pubby reviewed Apr 21, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Rename hsum_int_8 to hsum_i32_8

3a5958b

sw mentioned this pull request Apr 21, 2023

ggml : alternative Q4_3 implementation using modified Q8_0 #1109

Merged

ggerganov approved these changes Apr 22, 2023

View reviewed changes

ggerganov merged commit c5aa5e5 into ggerganov:master Apr 22, 2023

sw deleted the q43-avx branch April 22, 2023 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

sw commented Apr 21, 2023

slaren commented Apr 21, 2023

ggerganov commented Apr 21, 2023 •

edited

Loading

sw commented Apr 21, 2023 •

edited

Loading

sw commented Apr 21, 2023

ggerganov left a comment

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

Conversation

sw commented Apr 21, 2023

slaren commented Apr 21, 2023

ggerganov commented Apr 21, 2023 • edited Loading

sw commented Apr 21, 2023 • edited Loading

sw commented Apr 21, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Apr 21, 2023 •

edited

Loading

sw commented Apr 21, 2023 •

edited

Loading