Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge from upstream #48

Merged
merged 83 commits into from
Dec 11, 2024
Merged
Changes from 1 commit
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
0f77aae
sycl : offload of get_rows set to 0 (#10432)
Alcpz Nov 29, 2024
4b3242b
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580)
FanShupei Nov 29, 2024
f0678c5
ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggerganov Nov 29, 2024
a3a3048
cleanup UI link list (#10577)
slaren Nov 29, 2024
3a8e9af
imatrix : support combine-only (#10492)
robbiemu Nov 29, 2024
b782e5c
server : add more test cases (#10569)
ngxson Nov 29, 2024
7cc2d2c
ggml : move AMX to the CPU backend (#10570)
slaren Nov 29, 2024
0533e7f
vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536)
netrunnereve Nov 30, 2024
abadba0
readme : refresh (#10587)
ggerganov Nov 30, 2024
3e0ba0e
readme : remove old badge
ggerganov Nov 30, 2024
0c39f44
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…
angt Nov 30, 2024
43957ef
build: update Makefile comments for C++ version change (#10598)
wangqin0 Dec 1, 2024
6acce39
readme : update the usage section with examples (#10596)
ggerganov Dec 1, 2024
86dc11c
server : bind to any port when specified (#10590)
alek3y Dec 1, 2024
3420909
ggml : automatic selection of best CPU backend (#10606)
slaren Dec 1, 2024
5c7a5aa
ci: add error handling for Python venv creation in run.sh (#10608)
wangqin0 Dec 1, 2024
5e1ed95
grammars : add English-only grammar (#10612)
ggerganov Dec 1, 2024
917786f
Add `mistral-v1`, `mistral-v3`, `mistral-v3-tekken` and `mistral-v7` …
jukofyork Dec 1, 2024
4cb003d
contrib : refresh (#10593)
ggerganov Dec 2, 2024
991f8aa
SYCL: Fix and switch to GGML_LOG system instead of fprintf (#10579)
qnixsynapse Dec 2, 2024
64ed209
server: Add "tokens per second" information in the backend (#10548)
lhpqaq Dec 2, 2024
8648c52
make : deprecate (#10514)
ggerganov Dec 2, 2024
642330a
llama : add enum for built-in chat templates (#10623)
ngxson Dec 2, 2024
70b98fa
server : fix default draft model parameters (#10586)
ggerganov Dec 3, 2024
844e2e1
github : minify link [no ci]
ggerganov Dec 3, 2024
515d4e5
github : minify link [no ci] (revert)
ggerganov Dec 3, 2024
0115df2
metal : small-batch mat-mul kernels (#10581)
ggerganov Dec 3, 2024
82bca22
readme : add option, update default value, fix formatting (#10271)
pothitos Dec 3, 2024
3b4f2e3
llama : add missing LLAMA_API for llama_chat_builtin_templates (#10636)
ngxson Dec 3, 2024
667d70d
metal : add `GGML_OP_CONV_TRANSPOSE_1D` kernels (ggml/1026)
PABannier Nov 28, 2024
efb6ae9
feat: add `GGML_UNARY_OP_ARGMAX` Metal kernel (ggml/1019)
PABannier Dec 2, 2024
e9e661b
CUDA: remove unnecessary warp reduce in FA (ggml/1032)
mahorozte Dec 3, 2024
c505471
sync : ggml
ggerganov Dec 3, 2024
1cd3df4
scripts : remove amx sync
ggerganov Dec 3, 2024
91c36c2
server : (web ui) Various improvements, now use vite as bundler (#10599)
ngxson Dec 3, 2024
cc98896
vulkan: optimize and reenable split_k (#10637)
jeffbolznv Dec 3, 2024
01e6d9b
clip : add sycl support (#10574)
piDack Dec 4, 2024
da6aac9
Add docs for creating a static build (#10268) (#10630)
mostlygeek Dec 4, 2024
cd2f37b
Avoid using __fp16 on ARM with old nvcc (#10616)
frankier Dec 4, 2024
98036d5
fix typo of README.md (#10605)
WrRan Dec 4, 2024
40c6d79
SYCL : Move to compile time oneMKL interface backend selection for NV…
s-Nick Dec 4, 2024
2759916
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (…
jeffbolznv Dec 4, 2024
8d0cfd5
llama: Support MiniCPM-1B (with & w/o longrope) (#10559)
JFLFY2255 Dec 4, 2024
253b7fd
Fix HF repo commit to clone lora test models (#10649)
ltoniazzi Dec 4, 2024
2803540
ggml-cpu : fix HWCAP2_I8MM value (#10646)
slaren Dec 4, 2024
59f4db1
ggml : add predefined list of CPU backend variants to build (#10626)
slaren Dec 4, 2024
1da7b76
server : fix speculative decoding with context shift (#10641)
ggerganov Dec 4, 2024
f112d19
Update deprecation-warning.cpp (#10619)
aryantandon01 Dec 4, 2024
d405804
py : update outdated copy-paste instructions [no ci] (#10667)
danbev Dec 5, 2024
c2082d9
ggml : add `GGML_PAD_REFLECT_1D` operation (ggml/1034)
PABannier Dec 3, 2024
a8cbab2
ggml: add `GGML_SET` Metal kernel + i32 CPU kernel (ggml/1037)
PABannier Dec 4, 2024
0cd182e
sync : ggml
ggerganov Dec 5, 2024
6fe6247
llama : add Minerva 7B model support (#10673)
Riccorl Dec 5, 2024
c9c6e01
vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash a…
jeffbolznv Dec 5, 2024
7736837
fix(server) : not show alert when DONE is received (#10674)
pminev Dec 5, 2024
6c5bc06
server : (refactoring) do not rely on JSON internally (#10643)
ngxson Dec 6, 2024
f162d45
common : bring back --no-warmup to server (#10686)
ngxson Dec 6, 2024
c5ede38
convert : add custom attention mapping
ggerganov Dec 6, 2024
784a14a
convert : add support for Roberta embeddings (#10695)
Ssukriti Dec 7, 2024
86a1934
metal : Extend how Llama.cpp locates metal resources (#10676)
ormandi Dec 7, 2024
3df784b
Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processi…
0cc4m Dec 7, 2024
c2a16c0
server : fix free of spec context and batch (#10651)
ggerganov Dec 7, 2024
19d8762
ggml : refactor online repacking (#10446)
Djip007 Dec 7, 2024
ce4a7b8
server : various fixes (#10704)
ggerganov Dec 7, 2024
d9c3ba2
ggml : disable iq4_nl interleave size 8 (#10709)
ggerganov Dec 7, 2024
3573fa8
server : (refactor) no more json in server_task input (#10691)
ngxson Dec 7, 2024
62e84d9
llama : add 128k yarn context for Qwen (#10698)
robbiemu Dec 7, 2024
ecc93d0
vulkan: compile a test shader in cmake to check for coopmat2 support …
jeffbolznv Dec 8, 2024
43ed389
llama : use cmake for swift build (#10525)
slaren Dec 8, 2024
06d7014
Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (…
stduhpf Dec 8, 2024
e52522b
server : bring back info of final chunk in stream mode (#10722)
ngxson Dec 8, 2024
ce8784b
server : fix format_infill (#10724)
ngxson Dec 8, 2024
1a05004
cmake : simplify msvc charsets (#10672)
iboB Dec 9, 2024
3d98b4c
vulkan: fix compile warnings (#10731)
jeffbolznv Dec 9, 2024
c37fb4c
Changes to CMakePresets.json to add ninja clang target on windows (#1…
Srihari-mcw Dec 9, 2024
26a8406
CUDA: fix shared memory access condition for mmv (#10740)
JohannesGaessler Dec 9, 2024
a05e2af
vulkan: disable spirv-opt for coopmat shaders (#10763)
jeffbolznv Dec 10, 2024
a86ad84
server : add flag to disable the web-ui (#10762) (#10751)
eugeniosegala Dec 10, 2024
750cb3e
CUDA: rename macros to avoid conflicts with WinAPI (#10736)
aendk Dec 10, 2024
ae4b922
imatrix : Add imatrix to --no-context-shift (#10766)
bartowski1182 Dec 10, 2024
dafae66
vulkan: dynamic subgroup size for the remaining k quants (#10745)
netrunnereve Dec 10, 2024
b685daf
vulkan: request round-to-even for fp16 in im2col/rope_head (#10767)
jeffbolznv Dec 10, 2024
43041d2
ggml: load all backends from a user-provided search path (#10699)
giladgd Dec 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
vulkan: optimize and reenable split_k (ggml-org#10637)
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
jeffbolznv authored Dec 3, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit cc98896db858df7aa40d0e16a505883ef196a482
51 changes: 40 additions & 11 deletions ggml/src/ggml-vulkan/ggml-vulkan.cpp
Original file line number Diff line number Diff line change
@@ -165,6 +165,7 @@ struct vk_device_struct {
vk_queue transfer_queue;
bool single_queue;
uint32_t subgroup_size;
uint32_t shader_core_count;
bool uma;

size_t idx;
@@ -1498,7 +1499,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_get_rows_f32[GGML_TYPE_Q8_0], "get_rows_q8_0_f32", get_rows_q8_0_f32_len, get_rows_q8_0_f32_data, "main", 3, sizeof(vk_op_binary_push_constants), {1024, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_get_rows_f32[GGML_TYPE_IQ4_NL], "get_rows_iq4_nl_f32", get_rows_iq4_nl_f32_len, get_rows_iq4_nl_f32_data, "main", 3, sizeof(vk_op_binary_push_constants), {1024, 1, 1}, {}, 1);

ggml_vk_create_pipeline(device, device->pipeline_matmul_split_k_reduce, "split_k_reduce", split_k_reduce_len, split_k_reduce_data, "main", 2, 2 * sizeof(uint32_t), {256, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_matmul_split_k_reduce, "split_k_reduce", split_k_reduce_len, split_k_reduce_data, "main", 2, 2 * sizeof(uint32_t), {256 * 4, 1, 1}, {}, 1);

ggml_vk_create_pipeline(device, device->pipeline_mul_mat_vec_p021_f16_f32, "mul_mat_vec_p021_f16_f32", mul_mat_vec_p021_f16_f32_len, mul_mat_vec_p021_f16_f32_data, "main", 3, 6 * sizeof(uint32_t), {1, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_mul_mat_vec_nc_f16_f32, "mul_mat_vec_nc_f16_f32", mul_mat_vec_nc_f16_f32_len, mul_mat_vec_nc_f16_f32_data, "main", 3, 7 * sizeof(uint32_t), {1, 1, 1}, {}, 1);
@@ -1610,23 +1611,36 @@ static vk_device ggml_vk_get_device(size_t idx) {
const std::vector<vk::ExtensionProperties> ext_props = device->physical_device.enumerateDeviceExtensionProperties();

bool maintenance4_support = false;
bool sm_builtins = false;

// Check if maintenance4 is supported
for (const auto& properties : ext_props) {
if (strcmp("VK_KHR_maintenance4", properties.extensionName) == 0) {
maintenance4_support = true;
} else if (strcmp("VK_NV_shader_sm_builtins", properties.extensionName) == 0) {
sm_builtins = true;
}
}

vk::PhysicalDeviceProperties2 props2;
vk::PhysicalDeviceMaintenance3Properties props3;
vk::PhysicalDeviceMaintenance4Properties props4;
vk::PhysicalDeviceSubgroupProperties subgroup_props;
vk::PhysicalDeviceShaderSMBuiltinsPropertiesNV sm_props;
props2.pNext = &props3;
props3.pNext = &subgroup_props;

VkBaseOutStructure * last_struct = (VkBaseOutStructure *)&subgroup_props;

if (maintenance4_support) {
subgroup_props.pNext = &props4;
last_struct->pNext = (VkBaseOutStructure *)&props4;
last_struct = (VkBaseOutStructure *)&props4;
}
if (sm_builtins) {
last_struct->pNext = (VkBaseOutStructure *)&sm_props;
last_struct = (VkBaseOutStructure *)&sm_props;
}

device->physical_device.getProperties2(&props2);
device->properties = props2.properties;

@@ -1643,6 +1657,11 @@ static vk_device ggml_vk_get_device(size_t idx) {
device->vendor_id = device->properties.vendorID;
device->subgroup_size = subgroup_props.subgroupSize;
device->uma = device->properties.deviceType == vk::PhysicalDeviceType::eIntegratedGpu;
if (sm_builtins) {
device->shader_core_count = sm_props.shaderSMCount;
} else {
device->shader_core_count = 0;
}

bool fp16_storage = false;
bool fp16_compute = false;
@@ -2732,15 +2751,25 @@ static void ggml_vk_buffer_memset(vk_buffer& dst, size_t offset, uint32_t c, siz
dst->device->device.resetFences({ dst->device->fence });
}

static uint32_t ggml_vk_guess_split_k(int m, int n, int k) {
static uint32_t ggml_vk_guess_split_k(ggml_backend_vk_context * ctx, int m, int n, int k, const vk_pipeline& pipeline) {
VK_LOG_DEBUG("ggml_vk_guess_split_k(" << m << ", " << n << ", " << k << ")");
// if (k > 128 && (m < 128 || n < 128) && m > 2 && n > 2) {
// return 4;
// }

return 1;
uint32_t split_k = 1;
if (ctx->device->shader_core_count != 0 && m >= (int)pipeline->wg_denoms[0] && n >= (int)pipeline->wg_denoms[1]) {
// If k is 'large' and the SMs will fill less than halfway, use split_k.
uint32_t m_tiles = CEIL_DIV(m, pipeline->wg_denoms[0]);
uint32_t n_tiles = CEIL_DIV(n, pipeline->wg_denoms[1]);
if (k >= 2048 && m_tiles * n_tiles < ctx->device->shader_core_count / 2) {
split_k = ctx->device->shader_core_count / (m_tiles * n_tiles);
// Clamp to 2 or 4
split_k = std::min(split_k, 4u);
if (split_k == 3) {
split_k = 2;
}
}
}

GGML_UNUSED(m); GGML_UNUSED(n); GGML_UNUSED(k);
return split_k;
}

static vk_pipeline ggml_vk_guess_matmul_pipeline_amd(ggml_backend_vk_context * ctx, vk_matmul_pipeline& mmp, int m, int n, bool aligned) {
@@ -2964,10 +2993,10 @@ static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& sub
const uint32_t kpad = ggml_vk_align_size(ne10, ggml_vk_guess_matmul_pipeline_align(ctx, mmp, ne01, ne11));
const bool aligned = ne10 == kpad && ne01 > 8 && ne11 > 8;

const uint32_t split_k = ggml_vk_guess_split_k(ne01, ne11, ne10);

vk_pipeline pipeline = ggml_vk_guess_matmul_pipeline(ctx, mmp, ne01, ne11, aligned);

const uint32_t split_k = ggml_vk_guess_split_k(ctx, ne01, ne11, ne10, pipeline);

const uint64_t qx_sz = ggml_type_size(src0->type) * x_ne / ggml_blck_size(src0->type);
const uint64_t qy_sz = ggml_type_size(src1->type) * y_ne / ggml_blck_size(src1->type);
const uint64_t x_sz = !qx_needs_dequant ? qx_sz : sizeof(ggml_fp16_t) * x_ne;
@@ -2993,7 +3022,7 @@ static void ggml_vk_mul_mat_q_f16(ggml_backend_vk_context * ctx, vk_context& sub
if (dryrun) {
const uint64_t x_sz_upd = x_sz * ne02 * ne03;
const uint64_t y_sz_upd = y_sz * ne12 * ne13;
const uint64_t split_k_size = split_k > 1 ? d_sz * ne12 * ne13 * 4 : 0;
const uint64_t split_k_size = split_k > 1 ? d_sz * ne12 * ne13 * split_k : 0;
if (
(qx_needs_dequant && x_sz_upd > ctx->device->max_memory_allocation_size) ||
(qy_needs_dequant && y_sz_upd > ctx->device->max_memory_allocation_size) ||
31 changes: 25 additions & 6 deletions ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_split_k_reduce.comp
Original file line number Diff line number Diff line change
@@ -5,25 +5,44 @@
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

layout (binding = 0) readonly buffer A {float data_a[];};
layout (binding = 0) readonly buffer A4 {vec4 data_a4[];};
layout (binding = 1) writeonly buffer D {float data_d[];};
layout (binding = 1) writeonly buffer D4 {vec4 data_d4[];};

layout (push_constant) uniform parameter {
uint ne;
uint k_num;
} p;

void main() {
const uint idx = gl_GlobalInvocationID.x;
// Each invocation handles four consecutive components
const uint idx = gl_GlobalInvocationID.x * 4;

if (idx >= p.ne) {
return;
}

float result = 0.0f;
// Check if all four components are in bounds and aligned,
// then use vector loads
if (idx + 3 < p.ne && (p.ne % 4) == 0) {
vec4 result = vec4(0.0f);

[[unroll]] for (uint i = 0; i < p.k_num; i++) {
result += data_a[i * p.ne + idx];
}
[[unroll]] for (uint i = 0; i < p.k_num; i++) {
result += data_a4[(i * p.ne + idx) / 4];
}

data_d4[idx / 4] = result;
} else {
[[unroll]] for (uint j = 0; j < 4; ++j) {
if (idx + j < p.ne) {
float result = 0.0f;

data_d[idx] = result;
[[unroll]] for (uint i = 0; i < p.k_num; i++) {
result += data_a[i * p.ne + idx + j];
}

data_d[idx + j] = result;
}
}
}
}