Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml #2237

Merged
merged 102 commits into from
Jun 16, 2024
Merged
Changes from 1 commit
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
7bb8dab
ggml : add `ggml_upscale_ext` (ggml/814)
balisujohn May 15, 2024
215bcb3
Add missing " (llama/7303)
AidanBeltonS May 15, 2024
1c52d7f
ggml : tag ggml_tensor::backend as deprecated (llama/7290)
slaren May 15, 2024
a32324a
Avoid unnecessarily disabling CUDA graphs (llama/7302)
agray3 May 15, 2024
d3f4ab6
ggml : use dynamic thread scheduling for matrix multiplication (llama…
kunnis May 15, 2024
88b9d3b
Add support for properly optimized Windows ARM64 builds with LLVM and…
max-krasnyansky May 16, 2024
f1c281a
rpc : add command line arg for specifying backend memory
rgerganov May 15, 2024
831cf54
ggml : rewrite silu and softmax for cpu (llama/7154)
jart May 17, 2024
b321ba3
ggml-quants, llama : removed excess checks (llama/7274)
GermanAizek May 17, 2024
d64e133
rpc : set SO_REUSEADDR for the server socket (llama/7320)
rgerganov May 17, 2024
4fea7a9
CUDA: faster large batch FA without tensor cores (llama/7314)
JohannesGaessler May 17, 2024
653af39
ggml : fix quants nans when all the group weights are very close to z…
slaren May 18, 2024
449de6a
Update and fix Vulkan soft_max and argsort implementations (llama/7237)
0cc4m May 18, 2024
280208a
cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263)
Engininja2 May 18, 2024
e00ace4
CUDA: deduplicate FlashAttention code (llama/7352)
JohannesGaessler May 18, 2024
e211897
android : use "ci-android" branch for CI (llama/7341)
ggerganov May 18, 2024
dfe6b64
Capture CUDA logging output (llama/7298)
fraxy-v May 18, 2024
0d54e78
cuda : clear error after buffer allocation failure (llama/7376)
slaren May 19, 2024
570d7fd
ggml: implement quantized KV cache for FA (llama/7372)
JohannesGaessler May 19, 2024
acd5935
ggml : fix another case of quants nans (llama/7387)
slaren May 19, 2024
9bbf65b
Vulkan Embedding Fix (llama/7360)
0cc4m May 19, 2024
7db2a18
Add provisions for windows support for BF16 code including CMake prov…
Srihari-mcw May 20, 2024
80e2b35
ggml : add loongarch lsx and lasx support (llama/6454)
junchao-loongson May 20, 2024
85bbb06
ggml-opencl, llama: using reserve() if count already known (llama/7272)
GermanAizek May 20, 2024
cc50ea0
Update SYCL upscale operation (llama/7321)
AidanBeltonS May 20, 2024
2668d57
rpc : track allocated buffers (llama/7411)
rgerganov May 20, 2024
ed7eb40
CUDA: deduplicate mmq code (llama/7397)
JohannesGaessler May 21, 2024
aa29372
CUDA: fix unused warning in mmq.cu (llama/7442)
JohannesGaessler May 21, 2024
d2aa1ce
metal : handle F16 inf values, fix FA partial offload (llama/7434)
ggerganov May 21, 2024
1ffabc8
llama : add phi3 128K model support (llama/7225)
liuwei-git May 21, 2024
eca5fb8
cuda : fix rope + add tests (llama/7452)
ggerganov May 22, 2024
4228fb7
CUDA: remove incorrect precision check (llama/7454)
JohannesGaessler May 22, 2024
61d5a1e
cuda : fix compile warning (llama/7454)
ggerganov May 22, 2024
b08c0b0
CUDA: fix FA out-of-bounds writes (llama/7465)
JohannesGaessler May 22, 2024
f366504
CUDA: fix FA out-of-bounds reads (llama/7479)
JohannesGaessler May 22, 2024
a8f67b9
Update vulkan rope implementation to support frequency factors (llama…
0cc4m May 23, 2024
c2be650
ggml : drop support for QK_K=64 (llama/7473)
ggerganov May 23, 2024
1470bad
ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463)
ggerganov May 23, 2024
22d4b17
ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)
ggerganov May 23, 2024
024b58e
ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama…
msy-kato May 25, 2024
e7b39d8
ggml : restore ggml_rope_xpos_inplace (ggml/0)
ggerganov May 26, 2024
e934ba5
metal : disable FA kernel for HS=256 (llama/7556)
ggerganov May 27, 2024
0055948
metal : add GGML_OP_REPEAT kernels (llama/7557)
ggerganov May 27, 2024
9b0dbe8
Add freq factors (llama/7495)
AidanBeltonS May 27, 2024
b725bb2
Fix q_xxs using mul_mat_q (llama/7459)
AidanBeltonS May 27, 2024
b323cfc
Allow multiple copy function pointers for CUDA graph kernel param upd…
agray3 May 27, 2024
a133206
update HIP_UMA #7399 (llama/7414)
Djip007 May 27, 2024
d6d2508
ggml : generalize GGML_OP_CONCAT (llama/7563)
ggerganov May 28, 2024
023020c
fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436)
arthw May 28, 2024
42a9c95
rpc : resource management rework (llama/7562)
rgerganov May 28, 2024
7cc2ff0
vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE …
Adriankhl May 28, 2024
9ff003f
sycl : fix assert (llama/7563)
ggerganov May 28, 2024
eeb929a
Align GEMM dispatch (llama/7566)
airMeng May 28, 2024
f9df59a
ggml : fix typo in ggml.c (llama/7603)
zhouwg May 29, 2024
7e95420
examples : adapt to new ggml_concat (ggml/0)
ggerganov May 29, 2024
d53ab4b
ggml : use atomic_flag for critical section (llama/7598)
slaren May 29, 2024
78b74d5
llama-bench : add support for the RPC backend (llama/7435)
rgerganov May 29, 2024
f5de5d7
cuda : non-cont concat support (llama/7610)
ggerganov May 29, 2024
fa6b9ed
ggml : fix YARN + add tests + add asserts (llama/7617)
ggerganov May 29, 2024
7382fec
metal : add missing asserts (llama/7617)
ggerganov May 29, 2024
e3e1a98
metal : remove invalid asserts (llama/7617)
ggerganov May 29, 2024
55de6e0
ggml : fix loongarch build (O2 issue) (llama/7636)
junchao-loongson May 30, 2024
79088fe
faster avx512 exp implementation (llama/7551)
chriselrod May 30, 2024
b79eca7
ggml : fix loongson compile warnings (llama/7537)
ggerganov May 31, 2024
49c5ccb
CUDA: quantized KV support for FA vec (llama/7527)
JohannesGaessler Jun 1, 2024
5758ffa
CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)
JohannesGaessler Jun 1, 2024
bc6158d
Fix FlashAttention debug test, FP32 assert (llama/7684)
JohannesGaessler Jun 1, 2024
5f6620e
fix bug introduced in using calloc (llama/7701)
airlied Jun 2, 2024
f8b7a7f
kompute : implement op_getrows_f32 (llama/6403)
woachk Jun 3, 2024
9e95aa1
Vulkan Mixture of Experts (MoE) support (llama/7628)
0cc4m Jun 3, 2024
784733d
ggml : use OpenMP as a thread pool (llama/7606)
msy-kato Jun 3, 2024
0a6fd4e
llama : offload to RPC in addition to other backends (llama/7640)
rgerganov Jun 3, 2024
1b34416
ggml : prevent builds with -ffinite-math-only (llama/7726)
ggerganov Jun 4, 2024
69982c7
ggml : remove OpenCL (llama/7735)
ggerganov Jun 4, 2024
bf0ff58
Allow number of nodes in CUDA graph to change (llama/7738)
agray3 Jun 4, 2024
809d0f4
ggml : refactor rope norm/neox (llama/7634)
ggerganov Jun 5, 2024
048f479
CUDA: refactor mmq, dmmv, mmvq (llama/7716)
JohannesGaessler Jun 5, 2024
c5f01ea
fix softmax r2r result wrong issue (llama/7811)
pengxin99 Jun 7, 2024
e604adb
vulkan : reuse parent extra for views (llama/7806)
slaren Jun 7, 2024
bb7a50f
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
JohannesGaessler Jun 9, 2024
fa0b692
use the correct SYCL context for host USM allocations (llama/7777)
bashbaug Jun 10, 2024
b199187
CUDA: use tensor cores for MMQ (llama/7676)
JohannesGaessler Jun 10, 2024
28c0ccf
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
JohannesGaessler Jun 11, 2024
b30b2f4
Update Vulkan RoPE implementation (llama/7818)
0cc4m Jun 11, 2024
bfb2212
vulkan: select only one device for single gpu with multiple drivers (…
Adriankhl Jun 11, 2024
035d655
ggml : improve ggml_is_contiguous logic (llama/7856)
ggerganov Jun 12, 2024
3544c18
tests : add non-cont unary tests (llama/7857)
ggerganov Jun 12, 2024
e8f4fa0
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
JohannesGaessler Jun 12, 2024
ad6b8d5
move BLAS to a separate backend (llama/6210)
slaren Jun 13, 2024
08078b9
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
rgerganov Jun 13, 2024
f8ac7b1
metal : utilize max shared memory for mul_mat_id (llama/7935)
ggerganov Jun 14, 2024
8abc251
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
JohannesGaessler Jun 14, 2024
8efd6d6
remove global variables (llama/7710)
airMeng Jun 15, 2024
d2744cc
ggml : remove duplicate include of ggml-common.h (ggml/853)
danbev Jun 16, 2024
ce33d6f
ggml : fix and optimize ppc64le (ggml/849)
penghongbo Jun 16, 2024
92dc0b7
sync : ggml
ggerganov Jun 16, 2024
b891050
cmake : fix CUDA build (#0)
ggerganov Jun 16, 2024
16d44bd
talk-llama : sync llama.cpp
ggerganov Jun 16, 2024
c711647
cuda : enable CUDA graphs (#0)
ggerganov Jun 16, 2024
7252394
sycl : sync (#0)
ggerganov Jun 16, 2024
b51ff56
ggml : remove OpenCL (#0)
ggerganov Jun 16, 2024
f5b667d
cmake : fix sycl build (#0)
ggerganov Jun 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add freq factors (llama/7495)
  • Loading branch information
AidanBeltonS authored and ggerganov committed Jun 16, 2024
commit 9b0dbe883131b17ee635ceb35a7d34f853c49db9
94 changes: 57 additions & 37 deletions ggml-sycl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8830,12 +8830,11 @@ static void rope(
dst[i + 1] = x0*sin_theta + x1*cos_theta;
}

template<typename T, bool has_pos>
template<typename T, bool has_pos, bool has_freq_facs>
static void rope_neox(
const T * x, T * dst, int ncols, int n_dims, const int32_t * pos, float freq_scale, int p_delta_rows,
float ext_factor, float attn_factor, rope_corr_dims corr_dims, float theta_scale, float inv_ndims
,
const sycl::nd_item<3> &item_ct1) {
float ext_factor, float attn_factor, rope_corr_dims corr_dims, float theta_scale, float inv_ndims,
const float * freq_factors, const sycl::nd_item<3> &item_ct1) {
const int col = 2 * (item_ct1.get_local_range(1) * item_ct1.get_group(1) +
item_ct1.get_local_id(1));

Expand Down Expand Up @@ -8863,8 +8862,10 @@ static void rope_neox(
float cur_rot = inv_ndims * ic - ib;

const int p = has_pos ? pos[i2] : 0;
const float freq_factor = has_freq_facs ? freq_factors[ic/2] : 1.0f;

const float theta_base =
p * freq_scale * dpct::pow(theta_scale, col / 2.0f);
p * freq_scale * dpct::pow(theta_scale, col / 2.0f)/freq_factor;

float cos_theta, sin_theta;
rope_yarn(theta_base, freq_scale, corr_dims, cur_rot, ext_factor, attn_factor, &cos_theta, &sin_theta);
Expand Down Expand Up @@ -12413,7 +12414,7 @@ static void rope_neox_sycl(const T *x, T *dst, int ncols, int n_dims, int nrows,
const int32_t *pos, float freq_scale,
int p_delta_rows, float freq_base, float ext_factor,
float attn_factor, rope_corr_dims corr_dims,
dpct::queue_ptr stream) {
const float * freq_factors, dpct::queue_ptr stream) {
GGML_ASSERT(ncols % 2 == 0);
const sycl::range<3> block_dims(1, SYCL_ROPE_BLOCK_SIZE, 1);
const int num_blocks_x = (ncols + 2*SYCL_ROPE_BLOCK_SIZE - 1) / (2*SYCL_ROPE_BLOCK_SIZE);
Expand All @@ -12423,38 +12424,48 @@ static void rope_neox_sycl(const T *x, T *dst, int ncols, int n_dims, int nrows,
const float inv_ndims = -1.0f / n_dims;

if (pos == nullptr) {
/*
DPCT1049:42: The work-group size passed to the SYCL kernel may exceed
the limit. To get the device limit, query
info::device::max_work_group_size. Adjust the work-group size if needed.
*/
dpct::has_capability_or_fail(stream->get_device(),
{sycl::aspect::fp16});

stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
rope_neox<T, false>(x, dst, ncols, n_dims, pos, freq_scale,
p_delta_rows, ext_factor, attn_factor,
corr_dims, theta_scale, inv_ndims,
item_ct1);
});
if (freq_factors == nullptr) {
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
rope_neox<T, false, false>(x, dst, ncols, n_dims, pos, freq_scale,
p_delta_rows, ext_factor, attn_factor,
corr_dims, theta_scale, inv_ndims, freq_factors,
item_ct1);
});
} else {
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
rope_neox<T, false, true>(x, dst, ncols, n_dims, pos, freq_scale,
p_delta_rows, ext_factor, attn_factor,
corr_dims, theta_scale, inv_ndims, freq_factors,
item_ct1);
});
}
} else {
/*
DPCT1049:43: The work-group size passed to the SYCL kernel may exceed
the limit. To get the device limit, query
info::device::max_work_group_size. Adjust the work-group size if needed.
*/
dpct::has_capability_or_fail(stream->get_device(),
{sycl::aspect::fp16});

stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
rope_neox<T, true>(x, dst, ncols, n_dims, pos, freq_scale,
p_delta_rows, ext_factor, attn_factor,
corr_dims, theta_scale, inv_ndims, item_ct1);
});
if (freq_factors == nullptr) {
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
rope_neox<T, true, false>(x, dst, ncols, n_dims, pos, freq_scale,
p_delta_rows, ext_factor, attn_factor,
corr_dims, theta_scale, inv_ndims, freq_factors, item_ct1);
});
} else {
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
rope_neox<T, true, true>(x, dst, ncols, n_dims, pos, freq_scale,
p_delta_rows, ext_factor, attn_factor,
corr_dims, theta_scale, inv_ndims, freq_factors, item_ct1);
});
}
}
}

Expand Down Expand Up @@ -13986,9 +13997,7 @@ inline void ggml_sycl_op_rope(const ggml_tensor *src0, const ggml_tensor *src1,
ggml_tensor *dst, const float *src0_dd,
const float *src1_dd, float *dst_dd,
const dpct::queue_ptr &main_stream) {
#pragma message("TODO: implement phi3 frequency factors support")
#pragma message(" https://github.com/ggerganov/llama.cpp/pull/7225")
GGML_ASSERT(dst->src[2] == nullptr && "phi3 frequency factors not implemented yet");
const ggml_tensor * src2 = dst->src[2];

GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16);
GGML_ASSERT( dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_F16);
Expand All @@ -14014,6 +14023,7 @@ inline void ggml_sycl_op_rope(const ggml_tensor *src0, const ggml_tensor *src1,
memcpy(&beta_fast, (int32_t *) dst->op_params + 9, sizeof(float));
memcpy(&beta_slow, (int32_t *) dst->op_params + 10, sizeof(float));

const float * freq_factors = nullptr;
const int32_t * pos = nullptr;
if ((mode & 1) == 0) {
GGML_ASSERT(src1->type == GGML_TYPE_I32);
Expand All @@ -14024,6 +14034,16 @@ inline void ggml_sycl_op_rope(const ggml_tensor *src0, const ggml_tensor *src1,
const bool is_neox = mode & 2;
const bool is_glm = mode & 4;

if (is_neox) {
pos = (const int32_t *) src1_dd;

if (src2 != nullptr) {
freq_factors = (const float *) src2->data;
}
} else {
GGML_ASSERT(src2 == nullptr && "TODO: freq_factors not implemented for !is_neox");
}

rope_corr_dims corr_dims;
ggml_rope_yarn_corr_dims(n_dims, n_orig_ctx, freq_base, beta_fast, beta_slow, corr_dims.v);

Expand All @@ -14035,13 +14055,13 @@ inline void ggml_sycl_op_rope(const ggml_tensor *src0, const ggml_tensor *src1,
if (src0->type == GGML_TYPE_F32) {
rope_neox_sycl(
(const float *)src0_dd, (float *)dst_dd, ne00, n_dims, nrows, pos, freq_scale, ne01, freq_base, ext_factor,
attn_factor, corr_dims, main_stream
attn_factor, corr_dims, freq_factors, main_stream
);
} else if (src0->type == GGML_TYPE_F16) {
rope_neox_sycl((const sycl::half *)src0_dd, (sycl::half *)dst_dd,
ne00, n_dims, nrows, pos, freq_scale, ne01,
freq_base, ext_factor, attn_factor, corr_dims,
main_stream);
freq_factors, main_stream);
} else {
GGML_ASSERT(false);
}
Expand Down