-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099
Conversation
Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there. |
Before merging this: the current Time per token on M1 Pro:
I want to make it close to ~50-60 ms / token. Will try to optimize this with highest priority, so we can decide on the final |
Well #1083 was a bit rushed IMO, but I tried to address the loose ends. For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX. |
Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Q4_3
format seems will remain unchanged as it is on master
, so let's merge this.
If the AVX-only path has issues we will resolve later
After merge, will try to rebase #1109 and merge it as well
Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into
inline
functions.