[pull] master from ggerganov:master #164

pull · 2025-01-02T10:12:03Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

* list llama-swap under tools in README * readme: add llama-swap to Infrastructure

* slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench

* llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]

…ls (#11053) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

This commit attempts to improve the log message for the inputs of the splits in the sched_print_assignments function. The motivation for this change is that currently even if there are no inputs a colon is displayed at the end of the line, which can make it a little confusing when reading the output as it could be interpreted as the line below are inputs when they are in fact nodes. With this change the colon will only be printed if there actually are inputs.

…#11047) * Added init tensor calling code * Added get_alloc_size forwarding * Cleaned up and improved type/error handling. * fix: remove trailing whitespaces. * Cleanup and use GGML error logging functions. * Handle potentially dangerous edge cases. * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

…tary driver (#11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check

* CUDA: add BF16 support

ggml-ci

* mmap : fix fileno macro clash ggml-ci * cont ggml-ci

* tokenize : escape the prompt * tokenize : update help

* llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci

This commit renames the `batch` parameter to `ubatch` in the `llama_kv_cache_find_slot`, `llm_build_inp_embd`, and `llm_build_mamba` functions. The motivation for this is that this should have been done as part of Commit 19d900a ("llama : rename batch to ubatch (#9950)") but for some reason I missed these functions in that commit and only noticed them now (sorry).

* server : fix extra BOS in infill endpoing ggml-ci * server : update infill tests

@ngxson

* github : cmd line to bug report * codeowners : (@ngxson) only watch dockerfile * Apply suggestions from code review [no ci] Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * rm cmd in log output [no ci] * rm 2 [no ci] * no need backticks [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggml-ci

Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is a more reasonable 2048. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

readme : add llama-swap to infrastructure section (#11032)

a45433b

* list llama-swap under tools in README * readme: add llama-swap to Infrastructure

pull bot added the ⤵️ pull label Jan 2, 2025

github-actions bot added examples python server labels Jan 2, 2025

phymbert and others added 2 commits January 2, 2025 18:06

github-actions bot added the devops label Jan 3, 2025

metal : avoid uint (#11019)

e7da954

github-actions bot added ggml Apple Metal labels Jan 3, 2025

MollySophia and others added 2 commits January 3, 2025 14:13

fix: Vulkan shader gen binary path (#11037)

c31fc8b

github-actions bot added the Vulkan label Jan 4, 2025

danbev and others added 3 commits January 4, 2025 16:09

ggml : do not install metal source when embed library (ggml/1054)

5e3b08d

sync : ggml

78c6785

github-actions bot added the script label Jan 4, 2025

dranger003 and others added 5 commits January 4, 2025 16:33

llama : add support for the cohere2 model architecture (#10900)

46be942

Vulkan: Add device-specific blacklist for coopmat for the AMD proprie…

b56f079

…tary driver (#11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check

CUDA: add BF16 support (#11093)

46e3556

* CUDA: add BF16 support

github-actions bot added the Nvidia GPU label Jan 6, 2025

ggerganov added 5 commits January 6, 2025 10:52

llama : use _impl suffix instead of _internal (#11060)

5047dd3

ggml-ci

llama : use LLAMA_TOKEN_NULL (#11062)

727368c

ggml-ci

mmap : fix fileno macro clash (#11076)

ae2f606

* mmap : fix fileno macro clash ggml-ci * cont ggml-ci

tokenize : escape the prompt (#11058)

3e6e7a6

* tokenize : escape the prompt * tokenize : update help

llama : update llama_model API names (#11063)

47182dd

* llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci

github-actions bot added the testing label Jan 6, 2025

danbev and others added 7 commits January 6, 2025 11:28

llama : prevent system info string accumulation across calls (#11101)

96a1dc2

llama : remove check flash_attn with lora (#11104)

09186fa

server : fix extra BOS in infill endpoint (#11106)

e6e7c75

* server : fix extra BOS in infill endpoing ggml-ci * server : update infill tests

llama : remove unused headers (#11109)

ecebbd2

ggml-ci

llama-run : fix context size (#11094)

dc7cef9

Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is a more reasonable 2048. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

teleprint-me closed this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #164

[pull] master from ggerganov:master #164

pull bot commented Jan 2, 2025 •

edited

Loading

[pull] master from ggerganov:master #164

[pull] master from ggerganov:master #164

Conversation

pull bot commented Jan 2, 2025 • edited Loading

pull bot commented Jan 2, 2025 •

edited

Loading