Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

Merged
merged 3 commits into from
Apr 22, 2023

Conversation

sw
Copy link
Contributor

@sw sw commented Apr 21, 2023

Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into inline functions.

@slaren
Copy link
Collaborator

slaren commented Apr 21, 2023

q4_3
42.94 seconds per pass - ETA 7.81 hours
prompt eval time = 54411.09 ms /   631 tokens (   86.23 ms per token) bs=512
prompt eval time = 59126.51 ms /   631 tokens (   93.70 ms per token) bs=8
       eval time = 41742.69 ms /   255 runs   (  163.70 ms per run)

q4_1
35.42 seconds per pass - ETA 6.45 hours
prompt eval time = 52762.72 ms /   631 tokens (   83.62 ms per token) bs=512
prompt eval time = 56287.43 ms /   631 tokens (   89.20 ms per token) bs=8
       eval time = 41024.48 ms /   255 runs   (  160.88 ms per run)

Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there.

@ggerganov
Copy link
Owner

ggerganov commented Apr 21, 2023

Before merging this: the current Q4_3 format / implementation is not very efficient with ARM NEON:

Time per token on M1 Pro:

  • Q4_0 : 48ms
  • Q4_1 : 55ms
  • Q4_2 : 48ms
  • Q4_3 : 87ms

I want to make it close to ~50-60 ms / token.
But I think we might have to change the format if the optimization from #1083 does not work out.

Will try to optimize this with highest priority, so we can decide on the final Q4_3 format

@sw
Copy link
Contributor Author

sw commented Apr 21, 2023

Well #1083 was a bit rushed IMO, but I tried to address the loose ends.

For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX.

@sw sw marked this pull request as draft April 21, 2023 16:14
@sw
Copy link
Contributor Author

sw commented Apr 21, 2023

Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for quantize_row_q8_0, but I'm not sure I can trust the compiler to verify that I'm not accidentally using AVX2. It would be great if someone with an AVX-only machine could test this.

@sw sw marked this pull request as ready for review April 21, 2023 16:41
ggml.c Outdated Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Q4_3 format seems will remain unchanged as it is on master, so let's merge this.
If the AVX-only path has issues we will resolve later

After merge, will try to rebase #1109 and merge it as well

@ggerganov ggerganov merged commit c5aa5e5 into ggerganov:master Apr 22, 2023
@sw sw deleted the q43-avx branch April 22, 2023 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants