Skip to content

Thin cython, pybind11, and nanobind wrappers around llama.cpp

License

Notifications You must be signed in to change notification settings

shakfu/llamalib

Repository files navigation

llamalib - compiled python llama.cpp wrappers

Note: cyllama, a cython wrapper of llama.cpp, and one of the three python wrappers in this project, has been spun off to its own repo where more focused development will continue at a more frequent pace.

Development in this project has been frozen to llama.cpp tag b3982, and updates to this repo will be made much less frequently if at all. The reason for the freeze is that keeping up to date with llama.cpp's frequent changes with three different wrappers took too much time. It was decided to focus on the cython wrapper.


This project provides three different python wrappers of @ggerganov's llama.cpp which is likely the most active open-source compiled LLM inference engine. The python wrapping frameworks used are cython, pybind11, and nanobind and share the common feature of being compiled, and in this project statically linked, against llama.cpp.

Development goals are to:

  • Stay up-to-date with bleeding-edge llama.cpp.

  • Produce a minimal, performant, compiled, thin python wrapper around the core llama-cli feature-set of llama.cpp.

  • Integrate and wrap llava-cli features.

  • Integrate and wrap features from related projects such as whisper.cpp and stable-diffusion.cpp

  • Learn about the internals of this popular C++/C LLM inference engine along the way.

Given that there is a fairly mature, well-maintained and performant ctypes based wrapper provided by @abetlen's llama-cpp-python project and that llm inference is gpu-driven rather than cpu-driven, this all may see quite redundant. Nonetheless, we anticipate some benefits to using compiled wrappers:

  • Packaging benefits with respect to self-contained statically compiled extension modules.

  • There may be some performance improvements in the use of compiled wrappers over the use of ctypes.

  • It may be possible to incorporate external optimizations more readily into compiled wrappers, and

  • It provides a basis for integration with other code written in a wrapper variant.

  • It may be useful in case one wants to de-couple the python frontend and wrapper backends to existing frameworks: for example, it may be useful to just replace the ctypes wrapper in llama-cpp-python with one of compiled python wrappers and contribute it back as a PR.

  • This is the most efficient way, for me at least, to learn about the underlying technologies.

Status

Development only on macOS to keep things simple. The following table provide an overview of the current wrapping/dev status:

status pbllama nbllama cyllama
wrapper-type pybind11 nanobind cython
wrap llama.h 1 1 1
wrap high-level simple-cli 1 1 1
wrap low-level simple-cli 1 1 1
wrap low-level llama-cli 0 0 0

The initial milestone for each wrapper type was to create a high-level wrapper of simple.cpp llama.cpp example, following by a low-level one. The high-level wrapper c++ code is placed in llamalib.h single-header library, and wrapping is complete for all three frameworks. The final object is to fully wrap the functionality of llama-cli for all three wrapper-types.

Nonetheless, not all wrapping efforts proceed at an equal pace: in general, the cython wrapper is typically the most advanced of the 3. It has been spun off to its own repo where development will continue.

The following is a relatively low-level example of the cython wrapper at work:

from pathlib import Path

import cyllama as cy


# set path to model
MODEL = str(Path.cwd() / "models" / "Llama-3.2-1B-Instruct-Q8_0.gguf")

# configure params & prompt
params = cy.CommonParams()
params.model = MODEL
params.prompt = "When did the universe begin?"
params.n_predict = 32
params.n_ctx = 2048
params.cpuparams.n_threads = 4

# total length of the sequence including the prompt
n_predict: int = params.n_predict

# init LLM
cy.llama_backend_init()
cy.llama_numa_init(params.numa)

# initialize the model
model_params = cy.common_model_params_to_llama(params)
model = cy.LlamaModel(path_model=params.model, params=model_params)

# initialize the context
ctx_params = cy.common_context_params_to_llama(params)
ctx = cy.LlamaContext(model=model, params=ctx_params)

# build sampler chain
sparams = cy.llama_sampler_chain_default_params()
sparams.no_perf = False
smplr = cy.LlamaSampler(sparams)
smplr.add_greedy()

# tokenize the prompt
tokens_list: list[int] = cy.common_tokenize(ctx, params.prompt, True)
n_ctx: int = ctx.n_ctx()
n_kv_req: int = len(tokens_list) + (n_predict - len(tokens_list))

print("n_predict = %d, n_ctx = %d, n_kv_req = %d" % (n_predict, n_ctx, n_kv_req))

if n_kv_req > n_ctx:
    raise SystemExit(
        "error: n_kv_req > n_ctx, the required KV cache size is not big enough\n"
        "either reduce n_predict or increase n_ctx."
    )

# print the prompt token-by-token
print()
prompt = ""
for i in tokens_list:
    prompt += cy.common_token_to_piece(ctx, i)
print(prompt)

# create a llama_batch with size 512
# we use this object to submit token data for decoding

# create batch
batch = cy.LlamaBatch(n_tokens=512, embd=0, n_seq_max=1)

# evaluate the initial prompt
for i, token in enumerate(tokens_list):
    cy.common_batch_add(batch, token, i, [0], False)

# llama_decode will output logits only for the last token of the prompt
batch.set_last_logits_to_true()

ctx.decode(batch)

# main loop

n_cur: int = batch.n_tokens
n_decode: int = 0

t_main_start: int = cy.ggml_time_us()

result: str = ""

while n_cur <= n_predict:

    # sample the next token
    new_token_id = smplr.sample(ctx, batch.n_tokens - 1)
    smplr.accept(new_token_id)

    # is it an end of generation?
    if model.token_is_eog(new_token_id) or n_cur == n_predict:
        print()
        break

    result += cy.common_token_to_piece(ctx, new_token_id)

    # prepare the next batch
    cy.common_batch_clear(batch)

    # push this new token for next evaluation
    cy.common_batch_add(batch, new_token_id, n_cur, [0], True)

    n_decode += 1
    n_cur += 1

    # evaluate the current batch with the transformer model
    ctx.decode(batch)


print(result)
print()

t_main_end: int = cy.ggml_time_us()

print(
    "decoded %d tokens in %.2f s, speed: %.2f t/s"
    % (
        n_decode,
        (t_main_end - t_main_start) / 1000000.0,
        n_decode / ((t_main_end - t_main_start) / 1000000.0),
    )
)
print()

# cleanup
cy.llama_backend_free()

It goes without saying that any help / collaboration / contributions to accelerate the above would be welcome!

Setup

To build llamalib:

  1. A recent version of python3 (testing on python 3.12)

  2. cmake, which can be installed on MacOS using homebrew with brew install cmake

  3. The following python wrapping libraries, if you don't already have them. All python dependencies can be installed via pip install -r requirements.txt (feel free to use virtualenv if you like):

With the above dependencies installed, download and build the llamalib system, just type the following:

git clone https://github.com/shakfu/llamalib.git
cd llamalib
make

This will:

  1. Download and build llama.cpp
  2. Install it into bin, include, and lib in the cloned llamalib folder
  3. Build cyllama (cython wrapper)
  4. Build pbllama (pybind11 wrapper)
  5. Build nbllama (nanobind wrapper)

Testing

As a first step, you should download a smallish llm in the .gguf model from huggingface. This document provides some examples of models which have been known to work on a 16GB M1 Macbook air.

A good model to start with is Llama-3.2-1B-Instruct-Q6_K.gguf. After downloading it, place the model in the llamalib/models folder and run:

bin/llama-simple -c 512 -n 512 -m models/Llama-3.2-1B-Instruct-Q6_K.gguf \
	-p "Is mathematics discovered or invented?"

Now, you will need pytest installed to run tests:

pytest

If all tests pass, feel free to cd into the tests directory and run some examples directly, for example:

cd tests && python3 cy_simple.py`

TODO