sync : ggml #2237

ggerganov · 2024-06-16T10:11:55Z

No description provided.

* initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.

…/6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------

… MSVC (llama/7191) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ref: #7293

This change upstreams llamafile's vectorized expf() functions. This lets us compute softmax and silu more accurately than the short[65536] lookup table that GGML previously used to make this operation go faster. We can support aarch64 and sse2+ with the worst case rounding error of 2ulp. It makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512

ref: #7293

…ero (llama/7313)

* Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation

* android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir

* logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

* Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers

…ision for enabling AVX512_BF16 (llama/7258)

* add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <hejinyang@loongson.cn>

* Update SYCL upscale operation * Formatting * Remove messages

* rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly

ggml-ci

* add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

* ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci

* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits

* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <picard12@live.de>

Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>

* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early

* Update Vulkan RoPE implementation * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception Minor fixes * Fix segfault when running out of VRAM Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

…llama/7582)

* ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci

* tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci

* move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes

* separate DPCT helpers outside * replace global variables with context * remove useless extra * update mul_mat condition * remove duplicate buft initialization * remove duplicate extra and global work group size * remove useless backend check * remove duplicated extras * use macro for group_size and remove cuda-related

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* fix compile issues introduced by loongarch_asx * restore quant changes to merge * fix compile issues introduced by loongarch_asx * further optimize by using vec_msum & vec_sum4s on ppc64le

ggml-ci

balisujohn and others added 30 commits June 16, 2024 12:42

Add missing " (llama/7303)

215bcb3

ggml : tag ggml_tensor::backend as deprecated (llama/7290)

1c52d7f

rpc : add command line arg for specifying backend memory

f1c281a

ref: #7293

ggml-quants, llama : removed excess checks (llama/7274)

b321ba3

rpc : set SO_REUSEADDR for the server socket (llama/7320)

d64e133

ref: #7293

CUDA: faster large batch FA without tensor cores (llama/7314)

4fea7a9

ggml : fix quants nans when all the group weights are very close to z…

653af39

…ero (llama/7313)

Update and fix Vulkan soft_max and argsort implementations (llama/7237)

449de6a

* Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation

cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263)

280208a

CUDA: deduplicate FlashAttention code (llama/7352)

e00ace4

android : use "ci-android" branch for CI (llama/7341)

e211897

* android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir

cuda : clear error after buffer allocation failure (llama/7376)

0d54e78

ggml: implement quantized KV cache for FA (llama/7372)

570d7fd

ggml : fix another case of quants nans (llama/7387)

acd5935

Add provisions for windows support for BF16 code including CMake prov…

7db2a18

…ision for enabling AVX512_BF16 (llama/7258)

ggml-opencl, llama: using reserve() if count already known (llama/7272)

85bbb06

Update SYCL upscale operation (llama/7321)

cc50ea0

* Update SYCL upscale operation * Formatting * Remove messages

rpc : track allocated buffers (llama/7411)

2668d57

* rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly

CUDA: deduplicate mmq code (llama/7397)

ed7eb40

CUDA: fix unused warning in mmq.cu (llama/7442)

aa29372

metal : handle F16 inf values, fix FA partial offload (llama/7434)

d2aa1ce

ggml-ci

agray3 and others added 28 commits June 16, 2024 12:42

Allow number of nodes in CUDA graph to change (llama/7738)

bf0ff58

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

CUDA: refactor mmq, dmmv, mmvq (llama/7716)

048f479

* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits

fix softmax r2r result wrong issue (llama/7811)

c5f01ea

vulkan : reuse parent extra for views (llama/7806)

e604adb

* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <picard12@live.de>

CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)

bb7a50f

use the correct SYCL context for host USM allocations (llama/7777)

fa0b692

Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>

CUDA: use tensor cores for MMQ (llama/7676)

b199187

* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early

CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)

28c0ccf

vulkan: select only one device for single gpu with multiple drivers (…

bfb2212

…llama/7582)

ggml : improve ggml_is_contiguous logic (llama/7856)

035d655

* ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci

tests : add non-cont unary tests (llama/7857)

3544c18

* tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci

CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)

e8f4fa0

rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)

08078b9

metal : utilize max shared memory for mul_mat_id (llama/7935)

f8ac7b1

CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)

8abc251

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes

ggml : remove duplicate include of ggml-common.h (ggml/853)

d2744cc

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

ggml : fix and optimize ppc64le (ggml/849)

ce33d6f

* fix compile issues introduced by loongarch_asx * restore quant changes to merge * fix compile issues introduced by loongarch_asx * further optimize by using vec_msum & vec_sum4s on ppc64le

sync : ggml

92dc0b7

ggml-ci

cmake : fix CUDA build (#0)

b891050

talk-llama : sync llama.cpp

16d44bd

cuda : enable CUDA graphs (#0)

c711647

sycl : sync (#0)

7252394

ggml : remove OpenCL (#0)

b51ff56

cmake : fix sycl build (#0)

f5b667d

ggerganov merged commit 30841fa into master Jun 16, 2024
94 checks passed

ggerganov deleted the sync branch June 16, 2024 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #2237

sync : ggml #2237

ggerganov commented Jun 16, 2024

sync : ggml #2237

sync : ggml #2237

Conversation

ggerganov commented Jun 16, 2024