Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ggerganov:master #165

Closed
wants to merge 72 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
c0d6f79
SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 (#1…
qnixsynapse Jan 7, 2025
a4dd490
rpc : code cleanup (#11107)
rgerganov Jan 7, 2025
a3d50bc
ggml-backend : only offload from host buffers (#11120)
slaren Jan 7, 2025
017cc5f
ggml-backend : only offload from host buffers (fix) (#11124)
slaren Jan 7, 2025
53ff6b9
GGUF: C++ refactor, backend support, misc fixes (#11030)
JohannesGaessler Jan 7, 2025
bec2183
fix: Vulkan shader gen binary path when Cross-compiling (#11096)
ag2s20150909 Jan 8, 2025
02f0430
Disable GL_KHR_cooperative_matrix Vulkan extension if not available. …
mbaudier Jan 8, 2025
0d52a69
ci : fix cmake option (#11125)
ggerganov Jan 8, 2025
8cef75c
llamafile : ppc64le MMA INT8 implementation (#10912)
amritahs-ibm Jan 8, 2025
a3c1232
arg : option to exclude arguments from specific examples (#11136)
ggerganov Jan 8, 2025
80ccf5d
ci : pin dependency to specific version (#11137)
ngxson Jan 8, 2025
c792dcf
ggml : allow loading backend with env variable (ggml/1059)
rgerganov Jan 5, 2025
99a3755
sync : ggml
ggerganov Jan 8, 2025
c07d437
llama : avoid hardcoded QK_K (#11061)
ggerganov Jan 8, 2025
4d2b3d8
lora : improve compat with `mergekit-extract-lora` (#11131)
ngxson Jan 8, 2025
f7cd133
ci : use actions from ggml-org (#11140)
ngxson Jan 8, 2025
1bf839b
Enhance user input handling for llama-run (#11138)
ericcurtin Jan 8, 2025
8a1d9c2
gguf-py : move scripts directory (#11116)
VJHack Jan 8, 2025
8d59d91
fix: add missing msg in static_assert (#11143)
hydai Jan 8, 2025
d9feae1
llama-chat : add phi 4 template (#11148)
ngxson Jan 9, 2025
be0e950
media : remove old img [no ci]
ggerganov Jan 9, 2025
f8feb4b
model: Add support for PhiMoE arch (#11003)
phymbert Jan 9, 2025
8eceb88
server : add tooltips to settings and themes btn (#11154)
danbev Jan 9, 2025
1204f97
doc: add cuda guide for fedora (#11135)
teihome Jan 9, 2025
c6860cc
SYCL: Refactor ggml_sycl_compute_forward (#11121)
qnixsynapse Jan 10, 2025
ee7136c
llama: add support for QRWKV6 model architecture (#11001)
MollySophia Jan 10, 2025
c3f9d25
Vulkan: Fix float16 use on devices without float16 support + fix subg…
0cc4m Jan 10, 2025
ff3fcab
convert : add --print-supported-models option (#11172)
danbev Jan 10, 2025
ba8a1f9
examples : add README.md to tts example [no ci] (#11155)
danbev Jan 10, 2025
2739a71
convert : sort print supported models [no ci] (#11179)
danbev Jan 11, 2025
c05e8c9
gguf-py: fixed local detection of gguf package (#11180)
VJHack Jan 11, 2025
afa8a9e
llama : add `llama_vocab`, functions -> methods, naming (#11110)
ggerganov Jan 12, 2025
08f10f6
llama : remove notion of CLS token (#11064)
ggerganov Jan 12, 2025
9a48399
llama : fix chat template gguf key (#11201)
ngxson Jan 12, 2025
924518e
Reset color before we exit (#11205)
ericcurtin Jan 12, 2025
1244cdc
ggml : do not define GGML_USE_CUDA when building with GGML_BACKEND_DL…
rgerganov Jan 13, 2025
8f70fc3
llama : remove 'd' from bad special token log (#11212)
danbev Jan 13, 2025
7426a26
contrib : add naming guidelines (#11177)
ggerganov Jan 13, 2025
00b4c3d
common : support tag-based --hf-repo like on ollama (#11195)
ngxson Jan 13, 2025
ca001f6
contrib : add naming guidelines (cont) (#11177)
ggerganov Jan 13, 2025
437e05f
server : (UI) Support for RTL text as models input or output (#11208)
ebraminio Jan 13, 2025
a29f087
contrib : add naming guidelines (cont) (#11177)
ggerganov Jan 13, 2025
39509fb
cuda : CUDA Graph Compute Function Refactor (precursor for performanc…
aendk Jan 13, 2025
84a4481
cli : auto activate conversation mode if chat template is available (…
ngxson Jan 13, 2025
504af20
server : (UI) Improve messages bubble shape in RTL (#11220)
ebraminio Jan 13, 2025
d00a80e
scripts : sync opencl
ggerganov Jan 14, 2025
48e1ae0
scripts : sync gguf
ggerganov Jan 14, 2025
a4f3f5d
scripts : sync gguf (cont)
ggerganov Jan 14, 2025
44d1e79
sync : ggml
ggerganov Jan 14, 2025
091592d
Refactor test-chat-template.cpp (#11224)
ochafik Jan 14, 2025
c5bf0d1
server : Improve code snippets direction between RTL text (#11221)
ebraminio Jan 14, 2025
bbf3e55
vocab : add dummy tokens for "no_vocab" type (#11231)
ggerganov Jan 14, 2025
b4d92a5
ci : add -no-cnv for tests (#11238)
ngxson Jan 14, 2025
f446c2c
SYCL: Add gated linear attention kernel (#11175)
qnixsynapse Jan 15, 2025
0ccd7f3
examples : add embd_to_audio to tts-outetts.py [no ci] (#11235)
danbev Jan 15, 2025
432df2d
RoPE: fix back, CUDA support for back + noncont. (#11240)
JohannesGaessler Jan 15, 2025
1d85043
fix: ggml: fix vulkan-shaders-gen build (#10448)
sparkleholic Jan 15, 2025
f11cfdf
ci : use -no-cnv in gguf-split tests (#11254)
ggerganov Jan 15, 2025
adc5dd9
vulkan: scale caching for k quants + misc fixes (#11081)
netrunnereve Jan 15, 2025
c67cc98
ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (#11227)
fj-y-saito Jan 16, 2025
681149c
llama : add `llama_model_load_from_splits` (#11255)
ngxson Jan 16, 2025
9c8dcef
CUDA: backwards pass for misc. ops, add tests (#11257)
JohannesGaessler Jan 16, 2025
4dbc8b9
llama : add internlm3 support (#11233)
RunningLeon Jan 16, 2025
206bc53
vulkan: optimize coopmat2 q2_k dequant function (#11130)
jeffbolznv Jan 16, 2025
466300f
vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (#11206)
jeffbolznv Jan 16, 2025
bd38dde
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11…
jeffbolznv Jan 16, 2025
7a689c4
README : added kalavai to infrastructure list (#11216)
musoles Jan 17, 2025
960ec65
llama : fix deprecation message: vocabable -> vocab (#11269)
dwrensha Jan 17, 2025
a133566
vocab : fix double-eos check (#11273)
ggerganov Jan 17, 2025
667d728
rpc : early register backend devices (#11262)
rgerganov Jan 17, 2025
3edfa7d
llama.android: add field formatChat to control whether to parse speci…
codezjx Jan 17, 2025
44e18ef
vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281)
jeffbolznv Jan 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -665,7 +665,7 @@ jobs:
- build: 'llvm-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
- build: 'msvc-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-msvc.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DBUILD_SHARED_LIBS=O'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-msvc.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
- build: 'llvm-arm64-opencl-adreno'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON'

Expand Down Expand Up @@ -1237,7 +1237,7 @@ jobs:

- name: Create release
id: create_release
uses: anzz1/action-create-release@v1
uses: ggml-org/action-create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
Expand Down
3 changes: 1 addition & 2 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,9 @@ jobs:
GITHUB_BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
GITHUB_REPOSITORY_OWNER: '${{ github.repository_owner }}'

# https://github.com/jlumbroso/free-disk-space/tree/54081f138730dfa15788a46383842cd2f914a1be#example
- name: Free Disk Space (Ubuntu)
if: ${{ matrix.config.free_disk_space == true }}
uses: jlumbroso/free-disk-space@main
uses: ggml-org/free-disk-space@v1.3.1
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/editorconfig.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: editorconfig-checker/action-editorconfig-checker@main
- uses: editorconfig-checker/action-editorconfig-checker@v2
with:
version: v3.0.3
- run: editorconfig-checker
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
*.metallib
*.o
*.so
*.swp
*.tmp

# IDE / OS
Expand Down
6 changes: 6 additions & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,9 @@
/ci/ @ggerganov
/.devops/*.Dockerfile @ngxson
/examples/server/ @ngxson
/ggml/src/ggml-cuda/fattn* @JohannesGaessler
/ggml/src/ggml-cuda/mmq.* @JohannesGaessler
/ggml/src/ggml-cuda/mmv.* @JohannesGaessler
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
/ggml/src/ggml-opt.cpp @JohannesGaessler
/ggml/src/gguf.cpp @JohannesGaessler
102 changes: 96 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Pull requests (for contributors)

- Test your changes:
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments

Expand All @@ -20,14 +20,104 @@
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy-looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
- Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
- Naming usually optimizes for common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)
- Use sized integer types such as `int32_t` in the public API, e.g. `size_t` may also be appropriate for allocation sizes or byte offsets
- Declare structs with `struct foo {}` instead of `typedef struct foo {} foo`
- In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary
```cpp
// OK
llama_context * ctx;
const llama_rope_type rope_type;

// not OK
struct llama_context * ctx;
const enum llama_rope_type rope_type;
```

_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline.)_

- Try to follow the existing patterns in the code (indentation, spaces, etc.). In case of doubt use `clang-format` to format the added code
- For anything not covered in the current guidelines, refer to the [C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines)
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$

![matmul](media/matmul.png)

# Naming guidelines

- Use `snake_case` for function, variable and type names
- Naming usually optimizes for longest common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)

```cpp
// not OK
int small_number;
int big_number;

// OK
int number_small;
int number_big;
```

- Enum values are always in upper case and prefixed with the enum name

```cpp
enum llama_vocab_type {
LLAMA_VOCAB_TYPE_NONE = 0,
LLAMA_VOCAB_TYPE_SPM = 1,
LLAMA_VOCAB_TYPE_BPE = 2,
LLAMA_VOCAB_TYPE_WPM = 3,
LLAMA_VOCAB_TYPE_UGM = 4,
LLAMA_VOCAB_TYPE_RWKV = 5,
};
```

- The general naming pattern is `<class>_<method>`, with `<method>` being `<action>_<noun>`

```cpp
llama_model_init(); // class: "llama_model", method: "init"
llama_sampler_chain_remove(); // class: "llama_sampler_chain", method: "remove"
llama_sampler_get_seed(); // class: "llama_sampler", method: "get_seed"
llama_set_embeddings(); // class: "llama_context", method: "set_embeddings"
llama_n_threads(); // class: "llama_context", method: "n_threads"
llama_adapter_lora_free(); // class: "llama_adapter_lora", method: "free"
```

- The `get` `<action>` can be omitted
- The `<noun>` can be omitted if not necessary
- The `_context` suffix of the `<class>` is optional. Use it to disambiguate symbols when needed
- Use `init`/`free` for constructor/destructor `<action>`

- Use the `_t` suffix when a type is supposed to be opaque to the user - it's not relevant to them if it is a struct or anything else

```cpp
typedef struct llama_context * llama_context_t;

enum llama_pooling_type llama_pooling_type(const llama_context_t ctx);
```

_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline)_

- C/C++ filenames are all lowercase with dashes. Headers use the `.h` extension. Source files use the `.c` or `.cpp` extension
- Python filenames are all lowercase with underscores

- _(TODO: abbreviations usage)_

# Preprocessor directives

- _(TODO: add guidelines with examples and apply them to the codebase)_

```cpp
#ifdef FOO
#endif // FOO
```

# Documentation

- Documentation is a community effort
- When you need to look into the source code to figure out how to use an API consider adding a short summary to the header file for future reference
- When you notice incorrect or outdated documentation, please update it

# Resources

The Github issues, PRs and discussions contain a lot of information that can be useful to get familiar with the codebase. For convenience, some of the more important information is referenced from Github projects:
Expand Down
41 changes: 24 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
- [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
- [x] [PhiMoE](https://github.com/ggerganov/llama.cpp/pull/11003)
- [x] [GPT-2](https://huggingface.co/gpt2)
- [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
- [x] [InternLM2](https://huggingface.co/models?search=internlm2)
Expand Down Expand Up @@ -98,6 +99,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
- [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
- [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)

#### Multimodal
Expand Down Expand Up @@ -202,6 +204,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale

</details>

Expand Down Expand Up @@ -243,6 +246,8 @@ The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](htt
- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)

You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from Hugging Face by using this CLI argument: `-hf <user>/<model>[:quant]`

After downloading a model, use the CLI tools to run it locally - see below.

`llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.
Expand All @@ -261,21 +266,12 @@ To learn more about model quantization, [read this documentation](examples/quant
#### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.

- <details open>
<summary>Run simple text completion</summary>

```bash
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128

# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```

</details>

- <details>
<summary>Run in conversation mode</summary>

Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`

```bash
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv
llama-cli -m model.gguf

# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
Expand All @@ -287,17 +283,28 @@ To learn more about model quantization, [read this documentation](examples/quant
</details>

- <details>
<summary>Run with custom chat template</summary>
<summary>Run in conversation mode with custom chat template</summary>

```bash
# use the "chatml" template
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
# use the "chatml" template (use -h to see the list of supported templates)
llama-cli -m model.gguf -cnv --chat-template chatml

# use a custom template
llama-cli -m model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
```

[Supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
</details>

- <details>
<summary>Run simple text completion</summary>

To disable conversation mode explicitly, use `-no-cnv`

```bash
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```

</details>

Expand Down
Loading
Loading