Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl: Use syclcompat::dp4a #10267

Merged
merged 5 commits into from
Nov 15, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -917,7 +917,7 @@ jobs:
shell: bash

env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/7dff44ba-e3af-4448-841c-0d616c8da6e7/w_BaseKit_p_2024.1.0.595_offline.exe
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b380d914-366b-4b77-a74a-05e3c38b3514/intel-oneapi-base-toolkit-2025.0.0.882_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
steps:
Expand Down
2 changes: 2 additions & 0 deletions docs/backend/SYCL.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ The following release is verified with good quality:

## News

- 2024.11
- Use syclcompat to improve the performance on some backends. This requires to use oneAPI 2025.0 or more recent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some backends" is strange to SYCL backend.
Maybe "some platforms" or “some GPUs".

"more recent" -> "newer"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded this in 1c16516


- 2024.8
- Use oneDNN as the default GEMM library, improve the compatibility for new Intel GPUs.
Expand Down
27 changes: 0 additions & 27 deletions ggml/src/ggml-sycl/dpct/helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1830,33 +1830,6 @@ namespace dpct
: id);
}

template <typename T>
sycl::vec<T, 4> extract_and_sign_or_zero_extend4(T val)
{
return sycl::vec<T, 1>(val)
.template as<sycl::vec<
std::conditional_t<std::is_signed_v<T>, int8_t, uint8_t>, 4>>()
.template convert<T>();
}

template <typename T1, typename T2>
using dot_product_acc_t =
std::conditional_t<std::is_unsigned_v<T1> && std::is_unsigned_v<T2>,
uint32_t, int32_t>;

template <typename T1, typename T2, typename T3>
inline auto dp4a(T1 a, T2 b, T3 c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest replacing the dp4a() implementation by syclcompat::dp4a().

  1. no code change in other modules.
  2. easy to optimize for different cases in future if needed.

Copy link
Collaborator

@Alcpz Alcpz Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried this approach some time ago in a different PR, but it was closed because faster implementations requires asm and intrinsics for every backend, and we agreed to limit ourselves to pure SYCL code. Right now, there is no way to get visibility of int intrinsics (dp4a equivalents), and the syclcompat layer shipped as part of oneAPI is trying to bridge that (and other gaps) until they are made avialable through SYCL or an extension. With this approach, backend specific improvements are removed from the app itself.

do you think we could use this PR to agree what to do with regards to syclcompat? The main problem is that dp4a is a major performance gap with other backends due to the software implementation.

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I didn't clarify my idea.
I means the dpct::dp4a() call syclcompat::dp4a() directly.
In other models, they still call dpct::dp4a(). But the code path will be forward to syclcompat::dp4a().

Because there is no test data for Intel GPU. If it's bad, we can add code branch in dpct::dp4a() for Intel GPU with old code.

If all models call syclcompat::dp4a() directly as this PR, it's complex to implement for more branches case.

Copy link
Collaborator

@Alcpz Alcpz Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful of branching inside dp4a though, as we would introduce branching inside the kernels. Thanks for the clarification!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we don't add any branching I'm fine with wrapping syclcompat::dp4a inside dpct::dp4a. This is done in 3eff3c3. I hope this is what you meant.

{
dot_product_acc_t<T1, T2> res = c;
auto va = extract_and_sign_or_zero_extend4(a);
auto vb = extract_and_sign_or_zero_extend4(b);
res += va[0] * vb[0];
res += va[1] * vb[1];
res += va[2] * vb[2];
res += va[3] * vb[3];
return res;
}

struct sub_sat
{
template <typename T>
Expand Down
18 changes: 9 additions & 9 deletions ggml/src/ggml-sycl/mmq.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -575,8 +575,8 @@ vec_dot_q2_K_q8_1_impl_mmq(const int *__restrict__ v, const int *__restrict__ u,

#pragma unroll
for (int i = i0; i < i0 + QI8_1/2; ++i) {
sumi_d_sc = dpct::dp4a(v[i], u[i], sumi_d_sc); // SIMD dot product
sumi_m = dpct::dp4a(m, u[i],
sumi_d_sc = syclcompat::dp4a(v[i], u[i], sumi_d_sc); // SIMD dot product
sumi_m = syclcompat::dp4a(m, u[i],
sumi_m); // multiply sum of q8_1 values with m
}

Expand Down Expand Up @@ -730,7 +730,7 @@ vec_dot_q3_K_q8_1_impl_mmq(const int *__restrict__ v, const int *__restrict__ u,
int sumi_sc = 0;

for (int i = i0; i < i0 + QI8_1/2; ++i) {
sumi_sc = dpct::dp4a(v[i], u[i], sumi_sc); // SIMD dot product
sumi_sc = syclcompat::dp4a(v[i], u[i], sumi_sc); // SIMD dot product
}

sumi += sumi_sc * scales[i0 / (QI8_1/2)];
Expand Down Expand Up @@ -873,7 +873,7 @@ static __dpct_inline__ float vec_dot_q4_K_q8_1_impl_mmq(

#pragma unroll
for (int j = 0; j < QI8_1; ++j) {
sumi_d = dpct::dp4a((v[j] >> (4 * i)) & 0x0F0F0F0F,
sumi_d = syclcompat::dp4a((v[j] >> (4 * i)) & 0x0F0F0F0F,
u[i * QI8_1 + j], sumi_d); // SIMD dot product
}

Expand Down Expand Up @@ -1018,7 +1018,7 @@ static __dpct_inline__ float vec_dot_q5_K_q8_1_impl_mmq(

#pragma unroll
for (int j = 0; j < QI8_1; ++j) {
sumi_d = dpct::dp4a(v[i * QI8_1 + j], u[i * QI8_1 + j],
sumi_d = syclcompat::dp4a(v[i * QI8_1 + j], u[i * QI8_1 + j],
sumi_d); // SIMD dot product
}

Expand Down Expand Up @@ -1156,14 +1156,14 @@ vec_dot_q6_K_q8_1_impl_mmq(const int *__restrict__ v, const int *__restrict__ u,

#pragma unroll
for (int i = i0; i < i0 + 2; ++i) {
sumi_d.x() = dpct::dp4a(v[2 * i + 0], u[2 * i + 0],
sumi_d.x() = syclcompat::dp4a(v[2 * i + 0], u[2 * i + 0],
sumi_d.x()); // SIMD dot product
sumi_d.x() = dpct::dp4a(v[2 * i + 1], u[2 * i + 1],
sumi_d.x() = syclcompat::dp4a(v[2 * i + 1], u[2 * i + 1],
sumi_d.x()); // SIMD dot product

sumi_d.y() = dpct::dp4a(v[2 * i + 4], u[2 * i + 4],
sumi_d.y() = syclcompat::dp4a(v[2 * i + 4], u[2 * i + 4],
sumi_d.y()); // SIMD dot product
sumi_d.y() = dpct::dp4a(v[2 * i + 5], u[2 * i + 5],
sumi_d.y() = syclcompat::dp4a(v[2 * i + 5], u[2 * i + 5],
sumi_d.y()); // SIMD dot product
}

Expand Down
Loading