[pull] master from ggerganov:master #171

pull · 2025-01-26T16:12:04Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: thxCode <thxcode0824@gmail.com>

* Add initial ggml cmake package * Add build numbers to ggml find-package * Expand variables with GGML_ prefix * Guard against adding to cache variable twice * Add git to msys2 workflow * Handle ggml-cpu-* variants * Link ggml/ggml-base libraries to their targets * Replace main-cmake-pkg with simple-cmake-pkg * Interface features require c_std_90 * Fix typo * Removed unnecessary bracket from status message * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…11422) Signed-off-by: rare-magma <rare-magma@posteo.eu>

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

* ci : do not fail-fast for docker * build arm64/amd64 separatedly * fix pip * no fast fail * vulkan: try jammy

This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).

* impl::load change map bpe_ranks to onordered map for reduce time of impl::load on 30% * llama_model_loader::init_mapping - replace new llama_mmap to std::make_unique<llama_mmap> for clean code & reduce (/2) time of running init_mappings * Update src/llama-vocab.cpp --------- Co-authored-by: lexasub <empty@empty.ru> Co-authored-by: Diego Devesa <slarengh@gmail.com>

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

https://huggingface.co/docs/hub/en/ollama Signed-off-by: Eric Curtin <ecurtin@redhat.com>

The HTTP client in llama-run only prints an error in case the download of a resource failed. If the model name in the CLI parameter list is missing, this causes the application to crash. In order to prevent this, a check for the required model parameter has been added and errors for resource downloads get propagated to the caller. Signed-off-by: Michael Engel <mengel@redhat.com>

Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases

Signed-off-by: rare-magma <rare-magma@posteo.eu>

As pulling protocols to llama-run Signed-off-by: Eric Curtin <ecurtin@redhat.com>

…le instantation bug (#11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

* ci : fix build CPU arm64 * failed, trying ubuntu 22 * vulkan: ubuntu 24 * vulkan : jammy --> noble

The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True

#11466)

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

This commit enables the `--no-warmup` option for the llama-embeddings. The motivation for this change is to allow the user to disable the warmup when running the the program.

…(ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>

* Add option to not print stack on abort Add option/envvar to disable stack printing on abort. Also link some unittests with Threads to fix link errors on ubuntu/g++11. * Update ggml/src/ggml.c --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

People search for ollama models using the web ui, this change allows one to copy the url from the browser and for it to be compatible with llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

…11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging

* server : update auto gen files comments This commit updates the 'auto generated files' comments in server.cpp and removes `deps.sh` from the comment. The motivation for this change is that `deps.sh` was removed in Commit 91c36c2 ("server : (web ui) Various improvements, now use vite as bundler (#10599)"). * squash! server : update auto gen files comments [no ci] Move comments about file generation to README.md. * squash! server : update auto gen files comments [no ci] Remove the comments in server.cpp that mention that information can be found in the README.md file.

* vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

…ja functionality (#11489) * add /apply-template endpoint to server * remove unnecessary line * add /apply-template documentation * return only "prompt" field in /apply-template * use suggested idea instead of my overly verbose way

This commit updates some of JSON snippets in README.md file and removes the `json` language tag from the code blocks. The motivation for this changes is that if there is invalid json in a code snippet these are highlighted in red which can make it somewhat difficult to read and can be a little distracting.

…1503)

This commit replaces the two usages of `std::bind` in favor of lambdas for the callback functions for `callback_new_task` and `callback_update_slots`. The motivation for this changes is consistency with the rest of the code in server.cpp (lambdas are used for all other callbacks/handlers). Also lambdas are more readable (perhaps this is subjective) but also they are recommended over `std::bind` in modern C++. Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md

…1496)

…istral, Firefunction, DeepSeek) w/ lazy grammars (#9639) --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

ggerganov and others added 3 commits January 26, 2025 14:30

readme : update hot topics

2cc9b8c

rpc: fix register position (#11424)

1d8ee06

Signed-off-by: thxCode <thxcode0824@gmail.com>

pull bot added the ⤵️ pull label Jan 26, 2025

github-actions bot added examples devops ggml build labels Jan 26, 2025

rare-magma and others added 2 commits January 26, 2025 18:22

docker: add missing vulkan library to base layer and update to 24.04 (#…

6f53d8a

…11422) Signed-off-by: rare-magma <rare-magma@posteo.eu>

metal : use residency sets (#11427)

178a7eb

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

github-actions bot added the Apple Metal label Jan 26, 2025

ngxson and others added 5 commits January 26, 2025 22:45

docker : fix ARM build and Vulkan build (#11434)

caf773f

* ci : do not fail-fast for docker * build arm64/amd64 separatedly * fix pip * no fast fail * vulkan: try jammy

metal: Handle null returned from MTLCreateSystemDefaultDevice() (#11441)

acd38ef

This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).

llama: refactor llama_decode_impl (#11381)

df984e0

AMD: parse the architecture as supplied by gcnArchName (#11244)

d6d24cd

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

github-actions bot added the Nvidia GPU label Jan 27, 2025

ericcurtin and others added 3 commits January 27, 2025 19:36

Add new hf protocol for ollama (#11449)

a4417dd

https://huggingface.co/docs/hub/en/ollama Signed-off-by: Eric Curtin <ecurtin@redhat.com>

github-actions bot added SYCL testing labels Jan 28, 2025

rare-magma and others added 8 commits January 28, 2025 10:42

docker: add perplexity and bench commands to full image (#11438)

f643120

Signed-off-by: rare-magma <rare-magma@posteo.eu>

cmake : don't fail on GGML_CPU=OFF (#11457)

4bf3119

docker: allow installing pip packages system-wide (#11437)

d7d1ecc

Signed-off-by: rare-magma <rare-magma@posteo.eu>

Add github protocol pulling and http:// (#11465)

7fee288

As pulling protocols to llama-run Signed-off-by: Eric Curtin <ecurtin@redhat.com>

HIP: Only call rocblas_initialize on rocblas versions with the multip…

cae9fb4

…le instantation bug (#11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

HIP: Supress transformation warning in softmax.cu

be5ef79

loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

ci : fix build CPU arm64 (#11472)

d0c0804

* ci : fix build CPU arm64 * failed, trying ubuntu 22 * vulkan: ubuntu 24 * vulkan : jammy --> noble

server : Fixed wrong function name in llamacpp server unit test (#11473)

cf8cc85

The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True

github-actions bot added python server labels Jan 28, 2025

Emreerdog and others added 6 commits January 28, 2025 19:22

cmake: add hints for locating ggml on Windows using Llama find-package (

794fe23

#11466)

llama: fix missing k_cache store for rwkv6qwen2 (#11445)

325afb3

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

embedding : enable --no-warmup option (#11475)

b636228

This commit enables the `--no-warmup` option for the llama-embeddings. The motivation for this change is to allow the user to disable the warmup when running the the program.

ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. …

d2e518e

…(ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>

sync : ggml

8158577

github-actions bot added the script label Jan 29, 2025

ericcurtin and others added 2 commits January 29, 2025 11:23

Parse https://ollama.com/library/ syntax (#11480)

f0d4b29

People search for ollama models using the web ui, this change allows one to copy the url from the browser and for it to be compatible with llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

vulkan: Catch pipeline creation failure and print an error message (#…

2711d02

…11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging

github-actions bot added the Vulkan label Jan 29, 2025

danbev and others added 14 commits January 29, 2025 16:34

readme : reference examples relative links (#11505)

7919256

server : (docs) added response format for /apply-template [no ci] (#1…

496e5bf

…1503)

vocab : correctly identify LF token for GPT-2 style BPE tokenizer (#1…

ffd0821

…1496)

sync: minja (#11499)

3d804de

CUDA/HIP: add warp_size to cuda_device_info

c300e68

HIP: Prepare reduction operators for wave 64

6af1ca4

HIP: require at least HIP 5.5

27d135c

ci: ccache for all github worfklows (#11516)

553f1e4

teleprint-me closed this Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #171

[pull] master from ggerganov:master #171

pull bot commented Jan 26, 2025 •

edited

Loading

[pull] master from ggerganov:master #171

[pull] master from ggerganov:master #171

Conversation

pull bot commented Jan 26, 2025 • edited Loading

pull bot commented Jan 26, 2025 •

edited

Loading