Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : move AMX to the CPU backend #10570

Merged
merged 2 commits into from
Nov 29, 2024
Merged

ggml : move AMX to the CPU backend #10570

merged 2 commits into from
Nov 29, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Nov 28, 2024

  • Move AMX code to CPU backend
  • Enable disabled types in AMX backend (Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_XS)
  • Change C++ standard to C++17
  • Enable ccache for HIP windows CI

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend examples server ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2024
@slaren slaren force-pushed the sl/dl-backend-3 branch 2 times, most recently from a7c29b3 to 02b9c51 Compare November 28, 2024 19:06
@slaren slaren force-pushed the sl/dl-backend-3 branch 2 times, most recently from 3132814 to 1bc2a18 Compare November 28, 2024 19:38
@slaren slaren marked this pull request as ready for review November 28, 2024 19:49
@slaren slaren force-pushed the sl/dl-backend-3 branch 3 times, most recently from 436f36a to 273d8a0 Compare November 28, 2024 23:18
@github-actions github-actions bot added the devops improvements to build systems and github actions label Nov 28, 2024
ggml/src/ggml-cpu/amx/common.h Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@slaren slaren merged commit 7cc2d2c into master Nov 29, 2024
49 checks passed
@slaren slaren deleted the sl/dl-backend-3 branch November 29, 2024 20:55
@Djip007 Djip007 mentioned this pull request Nov 30, 2024
4 tasks
Comment on lines +76 to +77
/* .get_tensor = */ ggml_backend_amx_buffer_get_tensor,
/* .cpy_tensor = */ ggml_backend_amx_buffer_cpy_tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be call now this is a extra cpu buffer, only weight can be store in this buffer type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not called at the moment, but it doesn't hurt to have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I was just asking so as not to break everything in my PR.

int tbegin, tend;
balance211(n, params->nth, params->ith, tbegin, tend);
f(tbegin, tend);
ggml_barrier(params->threadpool); // TODO: might not always be needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply remove the ggml_barrier and add it if needed after parallel_for_ggml ?

Copy link
Collaborator Author

@slaren slaren Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really needed in any cases, I will just remove it in #10606.

@@ -2379,7 +2400,7 @@ void ggml_backend_amx_mul_mat(ggml_backend_amx_context * ctx, struct ggml_tensor
const int MB = div_up(M, BLOCK_M);
const int NB = div_up(N, BLOCK_N);

parallel_for(n_threads, MB * NB, [&](int begin, int end) {
parallel_for_ggml(params, MB * NB, [&](int begin, int end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fp16 matmul faster than LLAMAFILE fp16 sgemm?
if not, now this backend is in the CPU backend it may not be needed.

Copy link
Collaborator Author

@slaren slaren Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it is significantly faster than llamafile sgemm. In my tests it's about 40% faster at pp512.

Copy link
Contributor

@Djip007 Djip007 Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, it doesn't use AMX and yet it looks like the same way of doing things as with tinyblas. I'll have to look at this more closely 😎

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does use AVX512, although the implementation looks a lot simpler than in sgemm.

template <int BLOCK_M, int BLOCK_N, int BLOCK_K>
struct tinygemm_kernel_avx<float, ggml_fp16_t, float, BLOCK_M, BLOCK_N, BLOCK_K> {
static void apply(int K, const float * RESTRICT A, const ggml_fp16_t * RESTRICT B, float * RESTRICT C, int ldc) {
constexpr int ROWS = BLOCK_M;
constexpr int COLS = BLOCK_N;
assert(BLOCK_K == 16);
__m512 va;
__m512 vb[COLS];
__m512 vc[ROWS * COLS];
auto loadc = [&](auto idx) {
vc[idx] = _mm512_setzero_ps();
};
Unroll<ROWS * COLS>{}(loadc);
auto compute = [&](auto idx, auto k) {
constexpr int row = idx / COLS;
constexpr int col = idx % COLS;
if constexpr (col == 0) {
va = _mm512_loadu_ps(A + row * K + k);
}
if constexpr (row == 0) {
vb[col] = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)(B + col * K + k)));
}
vc[idx] = _mm512_fmadd_ps(va, vb[col], vc[idx]);
};
for (int k = 0; k < K; k += 16) {
Unroll<ROWS * COLS>{}(compute, k);
}
auto storec = [&](auto idx) {
constexpr int row = idx / COLS;
constexpr int col = idx % COLS;
C[row * ldc + col] = _mm512_reduce_add_ps(vc[idx]);
};
Unroll<ROWS * COLS>{}(storec);
}
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <int RM, int RN>
NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) {
int64_t ytiles = (m - m0) / RM;
int64_t xtiles = (n - n0) / RN;
int64_t tiles = xtiles * ytiles;
int64_t duty = (tiles + nth - 1) / nth;
int64_t start = duty * ith;
int64_t end = start + duty;
if (end > tiles)
end = tiles;
for (int64_t job = start; job < end; ++job) {
int64_t ii = m0 + job / xtiles * RM;
int64_t jj = n0 + job % xtiles * RN;
D Cv[RN][RM] = {};
for (int64_t l = 0; l < k; l += KN)
for (int64_t j = 0; j < RN; ++j)
for (int64_t i = 0; i < RM; ++i)
Cv[j][i] = madd(load<V>(A + lda * (ii + i) + l),
load<V>(B + ldb * (jj + j) + l),
Cv[j][i]);
for (int64_t j = 0; j < RN; ++j)
for (int64_t i = 0; i < RM; ++i)
C[ldc * (jj + j) + (ii + i)] = hsum(Cv[j][i]);
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not put AMX in the old patch because the 6th gen Xeon is not released at that moment. Since it is on sale now, i think it is good to add amx-f16 to the gemm.

@ochafik ochafik mentioned this pull request Dec 6, 2024
6 tasks
@ericcurtin
Copy link
Collaborator

👍 for "Change C++ standard to C++17"... Only ancient platforms don't support that these days, and probably ones you don't want to run AI workloads on anyway...

@mingfeima
Copy link
Collaborator

big thumbs up for "Switch to C++17". Actually the forced Unroll from my amx patch does not work properly on C++11, I checked the assembly and confirmed, on C++17 the unrolling is totally workable.

@slaren
Copy link
Collaborator Author

slaren commented Dec 13, 2024

I also changed the parameters of the functions called by Unroll to constexpr in #10606, following the comments that you had left in the code, but I didn't check the generated code. Good to know that it works as expected.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions examples ggml changes relating to the ggml tensor library for machine learning server testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants