Update to latest #1

Andreybest · 2023-11-22T14:56:54Z

No description provided.

…-org#3776) * cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels

) * speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab

* starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci

…tput (ggml-org#3823)

* llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs

…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit

…-org#3831)

…#3793) * Try cwd for ggml-metal if bundle lookup fails When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`, `server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]` returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of passing `null` as a path. Follows up on ggml-org#1782 * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…tra dev shell (ggml-org#3797) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)

* ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

ggml-ci

…#3843) * Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality

* ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file

@kalomaze

…gml-org#3841) * Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (ggml-org#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (ggml-org#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper

…#3876)

* scripts : add deploy-server.sh * scripts : rename to server-llm.sh * scripts : working curl pipe

* Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* impl --log-new, --log-append * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Apply suggestions from code review Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* Allow caller to handle help/argument exceptions * Prepend newline to usage output * Add new gpt_params_parse_ex function to hide arg-parse impl * Fix issue blocking success case * exit instead of returning false * Update common/common.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments

* llama : add functions to get the model's metadata * format -> std::to_string * better documentation

ggml-org#4074) - introduces help entry for the argument - cuts '--gpu-layers' form in order to simplify usage and documentation. Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Jiri Podivin <jpodivin@redhat.com>

Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Jiri Podivin <jpodivin@redhat.com>

…gml-org#4069)

* logging: improve escaping in yaml output * logging: include review feedback

Falcon HF compatibility

…amas to load (ggml-org#4089) Co-authored-by: Don Mahurin <@>

* build: support ppc64le build for make and CMake * build: keep __POWER9_VECTOR__ ifdef and extend with __powerpc64__ Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…gml-org#4124) * ggml-cuda.cu: Clean up warnings when compiling with clang * ggml-cuda.cu: Move static items into anonymous namespace * ggml-cuda.cu: Fix use of namespace start macro * Revert "ggml-cuda.cu: Fix use of namespace start macro" This reverts commit 26c1149. * Revert "ggml-cuda.cu: Move static items into anonymous namespace" This reverts commit e29757e.

Allow building with Makefile

* gguf-py : export chat templates * llama.cpp : escape new lines in gguf kv info prints * gguf-py : bump version * gguf-py : check chat_template type * gguf-py : initialize chat_template

)" This reverts commit 05e8301.

…g#4025) * Support special tokens and not adding BOS to prompt in speculative * Adapt to new should_add_bos function * Ensure tgt and dft have same add_bos setting

Disabled rules: * E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned * E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned * E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned * E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard * E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned * E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned * E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard * E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard * E266 Too many leading '#' for block comment - sometimes used as "section" separator * E501 Line too long - disabled because it's broken so often it seems like a standard * E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead) * E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead)

Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>

* Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

ggerganov and others added 30 commits October 26, 2023 22:54

server : do not release slot on image input (ggml-org#3798)

34b2a5e

simple : fix batch handling (ggml-org#3803)

c8d6a1f

llama : correctly report GGUFv3 format (ggml-org#3818)

6d459cb

speculative : ensure draft and target model vocab matches (ggml-org#3812

41aee4d

) * speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab

starcoder : add GPU offloading (ggml-org#3827)

fdee152

* starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci

common : print that one line of the syntax help *also* to standard ou…

1774611

…tput (ggml-org#3823)

llama : add option for greedy sampling with probs (ggml-org#3813)

ee1a0ec

* llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs

llama : allow quantizing k-quants to fall back when tensor size incom…

bd6d9e2

…patible (ggml-org#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit

convert : ignore tokens if their IDs are within [0, vocab_size) (ggml…

8a2f2fe

…-org#3831)

issues : change label from bug to bug-unconfirmed (ggml-org#3748)

ba231e8

flake : update flake.lock for newer transformers version + provide ex…

ff3bad8

…tra dev shell (ggml-org#3797) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)

llama : fix kv shift bug (ggml-org#3835)

71a09da

ggml-ci

make : remove unnecessary dependency on build-info.h (ggml-org#3842)

2046eb4

flake.nix: fix for rocm 5.7 (ggml-org#3853)

07178c9

server : re-enable completion and embedded at the same time (ggml-org…

ca190bc

…#3876)

scripts : add server-llm.sh (ggml-org#3868)

f0e2093

* scripts : add deploy-server.sh * scripts : rename to server-llm.sh * scripts : working curl pipe

ggml : fix UNUSED macro (ggml-org#3762)

9a3b4f6

sampling : null grammar field after reset (ggml-org#3885)

e75dfdd

llm : add llm_build_context (ggml-org#3881)

5033796

* llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments

common : minor (ggml-org#3715)

ff8f9a8

slaren and others added 27 commits November 17, 2023 17:17

llama : add functions to get the model's metadata (ggml-org#4013)

e85bb1a

* llama : add functions to get the model's metadata * format -> std::to_string * better documentation

py : remove superfluous import statements (ggml-org#4076)

f7d5e97

Signed-off-by: Jiri Podivin <jpodivin@gmail.com> Co-authored-by: Jiri Podivin <jpodivin@redhat.com>

llava : fix compilation warning that fread return value is not used (g…

c7cce12

…gml-org#4069)

common : improve yaml log escaping (ggml-org#4080)

9e87ef6

* logging: improve escaping in yaml output * logging: include review feedback

py : Falcon HF compatibility (ggml-org#4104)

11173c9

Falcon HF compatibility

convert : use 'model' value if it exists. This allows karpathy/tinyll…

2ab0707

…amas to load (ggml-org#4089) Co-authored-by: Don Mahurin <@>

examples : add tokenize (ggml-org#4039)

2fa02b4

tokenize : fix trailing whitespace

5ad387e

llama : increase max nodes (ggml-org#4115)

bbecf3f

scripts : Remove missed baichuan convert script (ggml-org#4127)

0b5c3b0

tokenize example: Respect normal add BOS token behavior (ggml-org#4126)

28a2e6e

Allow building with Makefile

gguf-py : export chat templates (ggml-org#4125)

e937066

* gguf-py : export chat templates * llama.cpp : escape new lines in gguf kv info prints * gguf-py : bump version * gguf-py : check chat_template type * gguf-py : initialize chat_template

gitignore : tokenize

35985ac

common : comma should be semicolon (ggml-org#4137)

262005a

server : relay error messages (ggml-org#4131)

936c79b

finetune : add --n-gpu-layers flag info to --help (ggml-org#4128)

05e8301

Revert "finetune : add --n-gpu-layers flag info to --help (ggml-org#4128

dae06c0

)" This reverts commit 05e8301.

speculative : fix prompt tokenization in speculative example (ggml-or…

40a34fe

…g#4025) * Support special tokens and not adding BOS to prompt in speculative * Adapt to new should_add_bos function * Ensure tgt and dft have same add_bos setting

main : Add ChatML functionality to main example (ggml-org#4046)

881800d

Co-authored-by: Sebastian Cramond <sebby37@users.noreply.github.com>

readme : update ROCm Windows instructions (ggml-org#4122)

dfc7cd4

* Update README.md * Update README.md Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

finetune - update readme to mention llama support only (ggml-org#4148)

0b871f1

stablelm : simplify + speedup generation (ggml-org#4153)

8e672ef

Merge remote-tracking branch 'ggreganov/master' into update-to-latest

d1252f8

Andreybest requested a review from olexiyb November 22, 2023 14:56

olexiyb merged commit 0aff05d into sanctum Nov 23, 2023

olexiyb deleted the update-to-latest branch November 23, 2023 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to latest #1

Update to latest #1

Andreybest commented Nov 22, 2023

Update to latest #1

Update to latest #1

Conversation

Andreybest commented Nov 22, 2023