-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : move AMX to the CPU backend #10570
Conversation
slaren
commented
Nov 28, 2024
•
edited
Loading
edited
- Move AMX code to CPU backend
- Enable disabled types in AMX backend (Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_XS)
- Change C++ standard to C++17
- Enable ccache for HIP windows CI
a7c29b3
to
02b9c51
Compare
3132814
to
1bc2a18
Compare
436f36a
to
273d8a0
Compare
273d8a0
to
d332fcf
Compare
d332fcf
to
f4898e1
Compare
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
/* .get_tensor = */ ggml_backend_amx_buffer_get_tensor, | ||
/* .cpy_tensor = */ ggml_backend_amx_buffer_cpy_tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this function be call now this is a extra cpu buffer, only weight can be store in this buffer type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not called at the moment, but it doesn't hurt to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I was just asking so as not to break everything in my PR.
int tbegin, tend; | ||
balance211(n, params->nth, params->ith, tbegin, tend); | ||
f(tbegin, tend); | ||
ggml_barrier(params->threadpool); // TODO: might not always be needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply remove the ggml_barrier and add it if needed after parallel_for_ggml ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really needed in any cases, I will just remove it in #10606.
@@ -2379,7 +2400,7 @@ void ggml_backend_amx_mul_mat(ggml_backend_amx_context * ctx, struct ggml_tensor | |||
const int MB = div_up(M, BLOCK_M); | |||
const int NB = div_up(N, BLOCK_N); | |||
|
|||
parallel_for(n_threads, MB * NB, [&](int begin, int end) { | |||
parallel_for_ggml(params, MB * NB, [&](int begin, int end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fp16 matmul faster than LLAMAFILE fp16 sgemm?
if not, now this backend is in the CPU backend it may not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe it is significantly faster than llamafile sgemm. In my tests it's about 40% faster at pp512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange, it doesn't use AMX and yet it looks like the same way of doing things as with tinyblas. I'll have to look at this more closely 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does use AVX512, although the implementation looks a lot simpler than in sgemm.
llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp
Lines 1332 to 1372 in 3420909
template <int BLOCK_M, int BLOCK_N, int BLOCK_K> | |
struct tinygemm_kernel_avx<float, ggml_fp16_t, float, BLOCK_M, BLOCK_N, BLOCK_K> { | |
static void apply(int K, const float * RESTRICT A, const ggml_fp16_t * RESTRICT B, float * RESTRICT C, int ldc) { | |
constexpr int ROWS = BLOCK_M; | |
constexpr int COLS = BLOCK_N; | |
assert(BLOCK_K == 16); | |
__m512 va; | |
__m512 vb[COLS]; | |
__m512 vc[ROWS * COLS]; | |
auto loadc = [&](auto idx) { | |
vc[idx] = _mm512_setzero_ps(); | |
}; | |
Unroll<ROWS * COLS>{}(loadc); | |
auto compute = [&](auto idx, auto k) { | |
constexpr int row = idx / COLS; | |
constexpr int col = idx % COLS; | |
if constexpr (col == 0) { | |
va = _mm512_loadu_ps(A + row * K + k); | |
} | |
if constexpr (row == 0) { | |
vb[col] = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)(B + col * K + k))); | |
} | |
vc[idx] = _mm512_fmadd_ps(va, vb[col], vc[idx]); | |
}; | |
for (int k = 0; k < K; k += 16) { | |
Unroll<ROWS * COLS>{}(compute, k); | |
} | |
auto storec = [&](auto idx) { | |
constexpr int row = idx / COLS; | |
constexpr int col = idx % COLS; | |
C[row * ldc + col] = _mm512_reduce_add_ps(vc[idx]); | |
}; | |
Unroll<ROWS * COLS>{}(storec); | |
} | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp
Lines 421 to 445 in 3420909
template <int RM, int RN> | |
NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) { | |
int64_t ytiles = (m - m0) / RM; | |
int64_t xtiles = (n - n0) / RN; | |
int64_t tiles = xtiles * ytiles; | |
int64_t duty = (tiles + nth - 1) / nth; | |
int64_t start = duty * ith; | |
int64_t end = start + duty; | |
if (end > tiles) | |
end = tiles; | |
for (int64_t job = start; job < end; ++job) { | |
int64_t ii = m0 + job / xtiles * RM; | |
int64_t jj = n0 + job % xtiles * RN; | |
D Cv[RN][RM] = {}; | |
for (int64_t l = 0; l < k; l += KN) | |
for (int64_t j = 0; j < RN; ++j) | |
for (int64_t i = 0; i < RM; ++i) | |
Cv[j][i] = madd(load<V>(A + lda * (ii + i) + l), | |
load<V>(B + ldb * (jj + j) + l), | |
Cv[j][i]); | |
for (int64_t j = 0; j < RN; ++j) | |
for (int64_t i = 0; i < RM; ++i) | |
C[ldc * (jj + j) + (ii + i)] = hsum(Cv[j][i]); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not put AMX in the old patch because the 6th gen Xeon is not released at that moment. Since it is on sale now, i think it is good to add amx-f16
to the gemm.
👍 for "Change C++ standard to C++17"... Only ancient platforms don't support that these days, and probably ones you don't want to run AI workloads on anyway... |
big thumbs up for "Switch to C++17". Actually the forced |
I also changed the parameters of the functions called by |
* ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>