merge from upstream #54

l3utterfly · 2025-02-24T15:21:05Z

No description provided.

* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX * Optimize mul_sum_i8_pairs_float for LoongArch ASX * Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX

This commit adds completion for `--chat-template-file`, enabling only `.jinja` files to be displayed as completions. Example usage: ```console $ ./build/bin/llama-cli --chat-template-file models/templates/<TAB> models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja models/templates/CohereForAI-c4ai-command-r-plus-tool_use.jinja models/templates/deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja models/templates/fireworks-ai-llama-3-firefunction-v2.jinja models/templates/google-gemma-2-2b-it.jinja models/templates/llama-cpp-deepseek-r1.jinja models/templates/meetkai-functionary-medium-v3.1.jinja models/templates/meetkai-functionary-medium-v3.2.jinja models/templates/meta-llama-Llama-3.1-8B-Instruct.jinja models/templates/meta-llama-Llama-3.2-3B-Instruct.jinja models/templates/meta-llama-Llama-3.3-70B-Instruct.jinja models/templates/microsoft-Phi-3.5-mini-instruct.jinja models/templates/mistralai-Mistral-Nemo-Instruct-2407.jinja models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja models/templates/Qwen-Qwen2.5-7B-Instruct.jinja ``` This is not limited to the models/templates directory, it can be used anywhere in the filesystem, the above is just an example.

* docker : drop to CUDA 12.4 * docker : update readme [no ci]

* opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`

@phil-scott-78

* setup windows linking for llguidance; thanks @phil-scott-78 * add build instructions for windows and update script link * change VS Community link from DE to EN * whitespace fix

* vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem

…11880)

* repo : update links to new url ggml-ci * cont : more urls ggml-ci

* readme : add notice about new package registry * cont : fix whitespace

* simple typo fixed * Update examples/imatrix/README.md --------- Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…Intel Macs. (#11904)

* docker : attempt fixing arm64 build on ci * qemu v7.0.0-28

This patch fixes a typo in command help. prefx -> prefix Signed-off-by: Masanari Iida <standby24x7@gmail.com>

* vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command

* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was: ./tests.sh unit/test_chat_completion.py.py -v -x It has been corrected to: ./tests.sh unit/test_chat_completion.py -v -x This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.

Signed-off-by: MoonRide303 <moonride303@gmail.com>

* server : add TEI API format for /rerank endpoint * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix * also gitignore examples/server/*.gz.hpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…1900) * tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type * addressed clang-tidy lints in [test-]chat.* * rm minja deps from util & common & move it to common/minja/ * add name & tool_call_id to common_chat_msg * add common_chat_tool * added json <-> tools, msgs conversions to chat.h * fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens) * fix deepseek r1 slow test (no longer <think> opening w/ new template) * allow empty tools w/ auto + grammar * fix & test server grammar & json_schema params w/ & w/o --jinja

…n iframe) (#11940) * Webui: Enable communication with parent html (if webui is in iframe): - Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server - On pressing na Escape button sends command "escapePressed" to the parent Example handling from the parent html side: - Send command "setText" from parent html to webui in iframe: const iframe = document.getElementById('askAiIframe'); if (iframe) { iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, '*'); } - Listen for Escape key from webui on parent html: // Listen for escape key event in the iframe window.addEventListener('keydown', (event) => { if (event.key === 'Escape') { // Process case when Escape is pressed inside webui } }); * Move the extraContext from storage to app.context. * Fix formatting. * add Message.extra * format + build * MessageExtraContext * build * fix display * rm console.log --------- Co-authored-by: igardev <ivailo.gardev@akros.ch> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

This commit adjusts the indentation for the functions `parse_sequence` and `parse_rule` in src/llama-grammar.cpp. The motivation is consistency and improve readability.

* speculative : update default params * speculative : do not discard the last drafted token

This commit adds a preset for llama.vim to use the default Qwen 2.5 Coder models. The motivation for this change is to make it easier to start a server suitable to be used with the llama.vim plugin. For example, the server can be started with a command like the following: ```console $ llama.vim --fim-qwen-1.5b-default ``` Refs: #10932

* llama : assign unknown/unused tensors to host buffer type ggml-ci * llama : skip unused tensors ggml-ci

The commit updates the help view in the llama.swiftui example to use a NavigationView and a Done button to dismiss the help view. The motivation for this is that without this change there is now way to dimiss the help view.

* ci : fix arm upload artifacts * cont : fix archive name to use matrix

* llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

@ericcurtin

* ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove test.py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml : fix LoongArch compile error with 128-bit SIMD (#11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Jinyang He <hejinyang@loongson.cn> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>

Use consolidated open function call from File class. Change read_all to to_string(). Remove exclusive locking, the intent for that lock is to avoid multiple processes writing to the same file, it's not an issue for readers, although we may want to consider adding a shared lock. Remove passing nullptr as reference, references are never supposed to be null. clang-format the code for consistent styling. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

Signed-off-by: Florent Benoit <fbenoit@redhat.com>

)

* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>

MQ-mengqing and others added 30 commits February 14, 2025 10:54

docker : drop to CUDA 12.4 (#11869)

dbc2ec5

* docker : drop to CUDA 12.4 * docker : update readme [no ci]

cuda : add ampere to the list of default architectures (#11870)

94b87f8

opencl: Fix rope and softmax (#11833)

300907b

* opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`

llguidance build fixes for Windows (#11664)

89daa25

* setup windows linking for llguidance; thanks @phil-scott-78 * add build instructions for windows and update script link * change VS Community link from DE to EN * whitespace fix

vulkan: initial support for IQ1_S and IQ1_M quantizations (#11528)

fc1b0d0

* vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem

server: fix type promotion typo causing crashes w/ --jinja w/o tools (#…

f355229

…11880)

repo : update links to new url (#11886)

68ff663

* repo : update links to new url ggml-ci * cont : more urls ggml-ci

readme : add notice about new package registry (#11890)

c2cd24f

* readme : add notice about new package registry * cont : fix whitespace

metal : optimize dequant q6_K kernel (#11892)

2288510

examples: fix typo in imatrix/README.md (#11884)

fc10c38

* simple typo fixed * Update examples/imatrix/README.md --------- Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

scripts: fix compare-llama-bench commit hash logic (#11891)

6dde178

metal : fix the crash caused by the lack of residency set support on …

c2ea16f

…Intel Macs. (#11904)

vulkan: support multi/vision rope, and noncontiguous rope (#11902)

bf42a23

ci : fix (again) arm64 build fails (#11895)

818a340

* docker : attempt fixing arm64 build on ci * qemu v7.0.0-28

common : Fix a typo in help (#11899)

fe163d5

This patch fixes a typo in command help. prefx -> prefix Signed-off-by: Masanari Iida <standby24x7@gmail.com>

server : bump httplib to 0.19.0 (#11908)

0f2bbe6

server : fix divide-by-zero in metrics reporting (#11915)

c4d29ba

update release requirements (#11897)

f7b1116

CUDA: use async data loading for FlashAttention (#11894)

73e2ed3

* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

scripts: corrected encoding when getting chat template (#11866) (#11907)

5137da7

Signed-off-by: MoonRide303 <moonride303@gmail.com>

llama : fix indentation in llama-grammar [no ci] (#11943)

9626d93

This commit adjusts the indentation for the functions `parse_sequence` and `parse_rule` in src/llama-grammar.cpp. The motivation is consistency and improve readability.

speculative : update default params (#11954)

abd4d0b

* speculative : update default params * speculative : do not discard the last drafted token

ggerganov and others added 15 commits February 21, 2025 18:33

llama : skip loading unused tensors (#12004)

51f311e

* llama : assign unknown/unused tensors to host buffer type ggml-ci * llama : skip unused tensors ggml-ci

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000)

d709084

server : disable Nagle's algorithm (#12020)

cf756d6

ci : Build on Github-hosted arm64 runners (#12009)

335eb04

CUDA: optimize FA for GQA + large batches (#12014)

5fa07c2

ci : fix arm upload artifacts (#12024)

f3e6485

* ci : fix arm upload artifacts * cont : fix archive name to use matrix

CUDA: app option to compile without FlashAttention (#12025)

a28e0d5

run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (#12041)

7ad0779

Signed-off-by: Florent Benoit <fbenoit@redhat.com>

SYCL: Fix GGML_SYCL_DEBUG macro (#11995)

8303e8b

gguf_convert_endian.py: implement byteswapping for q4_k and q6_k (#11349

651adf4

)

l3utterfly closed this Feb 24, 2025

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing examples devops python android server ggml Apple Metal script nix labels Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream #54

merge from upstream #54

l3utterfly commented Feb 24, 2025

merge from upstream #54

merge from upstream #54

Conversation

l3utterfly commented Feb 24, 2025