ggml : move AMX to the CPU backend #10570

slaren · 2024-11-28T17:09:18Z

Move AMX code to CPU backend
Enable disabled types in AMX backend (Q4_1, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_XS)
Change C++ standard to C++17
Enable ccache for HIP windows CI

ggml-ci

ggml/src/ggml-cpu/amx/common.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Djip007 · 2024-12-01T01:02:24Z

ggml/src/ggml-cpu/amx/amx.cpp

+    /* .get_tensor      = */ ggml_backend_amx_buffer_get_tensor,
+    /* .cpy_tensor      = */ ggml_backend_amx_buffer_cpy_tensor,


Can this function be call now this is a extra cpu buffer, only weight can be store in this buffer type?

It's not called at the moment, but it doesn't hurt to have it.

OK, I was just asking so as not to break everything in my PR.

Djip007 · 2024-12-01T01:20:10Z

ggml/src/ggml-cpu/amx/common.h

+    int tbegin, tend;
+    balance211(n, params->nth, params->ith, tbegin, tend);
+    f(tbegin, tend);
+    ggml_barrier(params->threadpool); // TODO: might not always be needed


simply remove the ggml_barrier and add it if needed after parallel_for_ggml ?

It's not really needed in any cases, I will just remove it in #10606.

Djip007 · 2024-12-01T01:36:00Z

ggml/src/ggml-cpu/amx/mmq.cpp

@@ -2379,7 +2400,7 @@ void ggml_backend_amx_mul_mat(ggml_backend_amx_context * ctx, struct ggml_tensor
        const int MB = div_up(M, BLOCK_M);
        const int NB = div_up(N, BLOCK_N);

-        parallel_for(n_threads, MB * NB, [&](int begin, int end) {
+        parallel_for_ggml(params, MB * NB, [&](int begin, int end) {


Is this fp16 matmul faster than LLAMAFILE fp16 sgemm?
if not, now this backend is in the CPU backend it may not be needed.

Yes, I believe it is significantly faster than llamafile sgemm. In my tests it's about 40% faster at pp512.

Strange, it doesn't use AMX and yet it looks like the same way of doing things as with tinyblas. I'll have to look at this more closely 😎

It does use AVX512, although the implementation looks a lot simpler than in sgemm.

llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp

Lines 1332 to 1372 in 3420909

template <int BLOCK_M, int BLOCK_N, int BLOCK_K>

struct tinygemm_kernel_avx<float, ggml_fp16_t, float, BLOCK_M, BLOCK_N, BLOCK_K> {

static void apply(int K, const float * RESTRICT A, const ggml_fp16_t * RESTRICT B, float * RESTRICT C, int ldc) {

constexpr int ROWS = BLOCK_M;

constexpr int COLS = BLOCK_N;

assert(BLOCK_K == 16);

__m512 va;

__m512 vb[COLS];

__m512 vc[ROWS * COLS];

auto loadc = [&](auto idx) {

vc[idx] = _mm512_setzero_ps();

};

Unroll<ROWS * COLS>{}(loadc);

auto compute = [&](auto idx, auto k) {

constexpr int row = idx / COLS;

constexpr int col = idx % COLS;

if constexpr (col == 0) {

va = _mm512_loadu_ps(A + row * K + k);

}

if constexpr (row == 0) {

vb[col] = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)(B + col * K + k)));

}

vc[idx] = _mm512_fmadd_ps(va, vb[col], vc[idx]);

};

for (int k = 0; k < K; k += 16) {

Unroll<ROWS * COLS>{}(compute, k);

}

auto storec = [&](auto idx) {

constexpr int row = idx / COLS;

constexpr int col = idx % COLS;

C[row * ldc + col] = _mm512_reduce_add_ps(vc[idx]);

};

Unroll<ROWS * COLS>{}(storec);

}

};

llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp

Lines 421 to 445 in 3420909

template <int RM, int RN>

NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) {

int64_t ytiles = (m - m0) / RM;

int64_t xtiles = (n - n0) / RN;

int64_t tiles = xtiles * ytiles;

int64_t duty = (tiles + nth - 1) / nth;

int64_t start = duty * ith;

int64_t end = start + duty;

if (end > tiles)

end = tiles;

for (int64_t job = start; job < end; ++job) {

int64_t ii = m0 + job / xtiles * RM;

int64_t jj = n0 + job % xtiles * RN;

D Cv[RN][RM] = {};

for (int64_t l = 0; l < k; l += KN)

for (int64_t j = 0; j < RN; ++j)

for (int64_t i = 0; i < RM; ++i)

Cv[j][i] = madd(load<V>(A + lda * (ii + i) + l),

load<V>(B + ldb * (jj + j) + l),

Cv[j][i]);

for (int64_t j = 0; j < RN; ++j)

for (int64_t i = 0; i < RM; ++i)

C[ldc * (jj + j) + (ii + i)] = hsum(Cv[j][i]);

}

}

I did not put AMX in the old patch because the 6th gen Xeon is not released at that moment. Since it is on sale now, i think it is good to add amx-f16 to the gemm.

ericcurtin · 2024-12-08T20:11:39Z

👍 for "Change C++ standard to C++17"... Only ancient platforms don't support that these days, and probably ones you don't want to run AI workloads on anyway...

mingfeima · 2024-12-13T07:05:34Z

big thumbs up for "Switch to C++17". Actually the forced Unroll from my amx patch does not work properly on C++11, I checked the assembly and confirmed, on C++17 the unrolling is totally workable.

slaren · 2024-12-13T14:53:21Z

I also changed the parameters of the functions called by Unroll to constexpr in #10606, following the comments that you had left in the code, but I didn't check the generated code. Good to know that it works as expected.

* ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend examples server ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2024

slaren force-pushed the sl/dl-backend-3 branch 2 times, most recently from a7c29b3 to 02b9c51 Compare November 28, 2024 19:06

slaren mentioned this pull request Nov 28, 2024

Add Intel Advanced Matrix Extensions (AMX) support to ggml #8998

Merged

4 tasks

slaren force-pushed the sl/dl-backend-3 branch 2 times, most recently from 3132814 to 1bc2a18 Compare November 28, 2024 19:38

slaren marked this pull request as ready for review November 28, 2024 19:49

slaren force-pushed the sl/dl-backend-3 branch 3 times, most recently from 436f36a to 273d8a0 Compare November 28, 2024 23:18

github-actions bot added the devops improvements to build systems and github actions label Nov 28, 2024

slaren force-pushed the sl/dl-backend-3 branch from 273d8a0 to d332fcf Compare November 28, 2024 23:27

ggml : move AMX to the CPU backend

f4898e1

ggml-ci

slaren force-pushed the sl/dl-backend-3 branch from d332fcf to f4898e1 Compare November 28, 2024 23:33

slaren requested a review from ggerganov November 29, 2024 20:00

ggerganov approved these changes Nov 29, 2024

View reviewed changes

ggml/src/ggml-cpu/amx/common.h Outdated Show resolved Hide resolved

Update ggml/src/ggml-cpu/amx/common.h

12115e2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

slaren merged commit 7cc2d2c into master Nov 29, 2024
49 checks passed

slaren deleted the sl/dl-backend-3 branch November 29, 2024 20:55

Animaxx mentioned this pull request Nov 30, 2024

Compile bug: "'ggml-cpu-impl.h' file not found" in iOS project #10591

Closed

Djip007 mentioned this pull request Nov 30, 2024

Refactor/online repacking #10446

Merged

4 tasks

Djip007 reviewed Dec 1, 2024

View reviewed changes

ochafik mentioned this pull request Dec 6, 2024

Switch to C++17 google/minja#12

Open

6 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

ggml : move AMX to the CPU backend (ggerganov#10570)

df8e981

* ggml : move AMX to the CPU backend --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : move AMX to the CPU backend #10570

ggml : move AMX to the CPU backend #10570

slaren commented Nov 28, 2024 •

edited

Loading

Djip007 Dec 1, 2024

slaren Dec 1, 2024

Djip007 Dec 1, 2024

Djip007 Dec 1, 2024

slaren Dec 1, 2024 •

edited

Loading

Djip007 Dec 1, 2024

slaren Dec 1, 2024 •

edited

Loading

Djip007 Dec 1, 2024 •

edited

Loading

slaren Dec 1, 2024

Djip007 Dec 1, 2024

mingfeima Dec 13, 2024

ericcurtin commented Dec 8, 2024

mingfeima commented Dec 13, 2024

slaren commented Dec 13, 2024

		/* .get_tensor = */ ggml_backend_amx_buffer_get_tensor,
		/* .cpy_tensor = */ ggml_backend_amx_buffer_cpy_tensor,

	template <int BLOCK_M, int BLOCK_N, int BLOCK_K>
	struct tinygemm_kernel_avx<float, ggml_fp16_t, float, BLOCK_M, BLOCK_N, BLOCK_K> {
	static void apply(int K, const float * RESTRICT A, const ggml_fp16_t * RESTRICT B, float * RESTRICT C, int ldc) {
	constexpr int ROWS = BLOCK_M;
	constexpr int COLS = BLOCK_N;
	assert(BLOCK_K == 16);

	__m512 va;
	__m512 vb[COLS];
	__m512 vc[ROWS * COLS];

	auto loadc = [&](auto idx) {
	vc[idx] = _mm512_setzero_ps();
	};
	Unroll<ROWS * COLS>{}(loadc);

	auto compute = [&](auto idx, auto k) {
	constexpr int row = idx / COLS;
	constexpr int col = idx % COLS;

	if constexpr (col == 0) {
	va = _mm512_loadu_ps(A + row * K + k);
	}
	if constexpr (row == 0) {
	vb[col] = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i )(B + col K + k)));
	}
	vc[idx] = _mm512_fmadd_ps(va, vb[col], vc[idx]);
	};

	for (int k = 0; k < K; k += 16) {
	Unroll<ROWS * COLS>{}(compute, k);
	}

	auto storec = [&](auto idx) {
	constexpr int row = idx / COLS;
	constexpr int col = idx % COLS;
	C[row * ldc + col] = _mm512_reduce_add_ps(vc[idx]);
	};
	Unroll<ROWS * COLS>{}(storec);
	}
	};

	template <int RM, int RN>
	NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n) {
	int64_t ytiles = (m - m0) / RM;
	int64_t xtiles = (n - n0) / RN;
	int64_t tiles = xtiles * ytiles;
	int64_t duty = (tiles + nth - 1) / nth;
	int64_t start = duty * ith;
	int64_t end = start + duty;
	if (end > tiles)
	end = tiles;
	for (int64_t job = start; job < end; ++job) {
	int64_t ii = m0 + job / xtiles * RM;
	int64_t jj = n0 + job % xtiles * RN;
	D Cv[RN][RM] = {};
	for (int64_t l = 0; l < k; l += KN)
	for (int64_t j = 0; j < RN; ++j)
	for (int64_t i = 0; i < RM; ++i)
	Cv[j][i] = madd(load<V>(A + lda * (ii + i) + l),
	load<V>(B + ldb * (jj + j) + l),
	Cv[j][i]);
	for (int64_t j = 0; j < RN; ++j)
	for (int64_t i = 0; i < RM; ++i)
	C[ldc * (jj + j) + (ii + i)] = hsum(Cv[j][i]);
	}
	}

ggml : move AMX to the CPU backend #10570

ggml : move AMX to the CPU backend #10570

Conversation

slaren commented Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slaren Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Djip007 Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericcurtin commented Dec 8, 2024

mingfeima commented Dec 13, 2024

slaren commented Dec 13, 2024

slaren commented Nov 28, 2024 •

edited

Loading

slaren Dec 1, 2024 •

edited

Loading

slaren Dec 1, 2024 •

edited

Loading

Djip007 Dec 1, 2024 •

edited

Loading