Support for multi-modal models #813

rlancemartin · 2023-10-11T20:14:53Z

I see LLama.cpp is working on multi-modal models like LLaVA:
ggerganov/llama.cpp#3436

Model is here:

2ab9be51b7dc737136b38093316a4d3577d1fb96281f1589adac7841f5b81c43  ../models/ggml-model-q5_k.gguf
b7c8ff0f58fca47d28ba92c4443adf8653f3349282cb8d9e6911f22d9b3814fe  ../models/mmproj-model-f16.gguf

Testing:

$ mkdir build && cd build && cmake ..
$ cmake --build .
$ ./bin/llava -m ../models/ggml-model-q5_k.gguf --mmproj ../models/mmproj-model-f16.gguf --image ~/Desktop/Papers/figure-3-1.jpg

Appears to add some new params:

--mmproj MMPROJ_FILE  path to a multimodal projector file for LLaVA. see examples/llava/README.md
--image IMAGE_FILE    path to an image file. use with multimodal models

It would be awesome if we can support in llama-cpp-python.

The text was updated successfully, but these errors were encountered:

Josh-XT · 2023-10-12T16:11:12Z

That PR has been merged as of about an hour ago now.

rlancemartin · 2023-10-12T17:11:00Z

@abetlen any thoughts on this?

damian0815 · 2023-10-13T08:38:27Z

i'm working on a strategy for supporting this by adding base64 image support to llama.cpp - this will mean the image can just be passed within the prompt. cf https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal

marscod · 2023-10-15T21:40:48Z

I have the latest version 0.2.11, with built llava from lama.cpp and getting this error:

from llama_cpp import (Llama, clip_model_load, llava_image_embed_make_with_filename, llava_image_embed_make_with_bytes,
ImportError: cannot import name 'clip_model_load' from 'llama_cpp'

sagar-kris · 2023-10-20T05:52:45Z

@marscod v0.2.11 doesn't have the latest merged changes yet, try

pip uninstall llama-cpp-python -y

and then to install with the latest merged changes

pip install -U --no-cache-dir llama-cpp-python@git+https://github.com/abetlen/llama-cpp-python.git@ef03d77b59718f7d422f24b15c653c2a57b087f3

Josh-XT · 2023-10-20T12:30:32Z

@marscod v0.2.11 doesn't have the latest merged changes yet, try

pip uninstall llama-cpp-python -y

and then to install with the latest merged changes

pip install -U --no-cache-dir llama-cpp-python@git+https://github.com/abetlen/llama-cpp-python.git@ef03d77b59718f7d422f24b15c653c2a57b087f3

It hasn't been merged yet. #821

z3ugma · 2023-10-30T02:13:32Z

I followed the direction to install at that commit ef03d77b5, how do I specify to use the llava instance and pass the mmproj file as an argument when using in Python code

from llama_cpp import Llama
llm = Llama(model_path="/Users/fred/llama.cpp/models/ggml-model-q4_k.gguf.1", mmproj=)

   from llama_cpp import (
ImportError: cannot import name 'clip_model_load' from 'llama_cpp' (venv/lib/python3.11/site-packages/llama_cpp/__init__.py)

rlancemartin · 2023-10-30T13:44:38Z

I talked to @ggerganov: what is missing is a Python API to run the CLIP model. There are plans for CLIP support natively in llama.cpp as a new architecture and in this case projects like llama-cpp-python can simply use the existing API.

z3ugma · 2023-10-30T16:51:00Z

@rlancemartin is that CLIP support natively in llama.cpp tracked under a specific issue or PR in the llama.cpp repo? cc @ggerganov

ggerganov · 2023-11-01T14:36:13Z

There was an attempt to implement a LLaVA API as part of the llama.cpp library here: ggerganov/llama.cpp#3613
But I don't really like the proposal, so I suggested to temporarily build a second library as part of the llava example until we support CLIP natively in llama.cpp (ggerganov/llama.cpp#3613 (comment))

Not sure if this approach (with the second llava lib) would work for llama-cpp-python though. On llama.cpp side, I think it would be rather easy to implement - just need to move some files around using the linked PR as a starting point and add build steps for the llava (or llava.cpp) lib. And then I guess llama-cpp-python would have to bind to both llama.cpp and llava libs

Other than this, we don't have a specific issue for tracking this yet. I think @monatis would be the best person to say what would it take to support CLIP arch straight into llama.cpp, as they implemented the current clip.cpp code.

monatis · 2023-11-01T16:01:46Z

If we decide to implement CLIP arch straight into llama.cpp, the required work depends on whether we want to support the full architecture or just as a image encoder for multimodal models. In any case, I need to revisit the conversion script and finalize the key-value pairs and tensor names to be future-proof.

For multimodal only:

Add image loading / preprocessing code. This will need linking stb-image.h to llama lib --not sure if we want this.
Update the current inference code to re-use the existing functionality in llama lib, e.g., model loading, offloading etc.

For the full architecture, additionally:
3. Implement CLIP tokenization, which is slightly different from the existing tokenizations in llama

These all can take some time. I initially started to implement in lmm.cpp, but I know from my clip.cpp experience it takes some time to actively support / manage a project on solo. I also have other ideas to combine all embedding models (CLIP, BERT, E5 etc.) in a embeddings library and then implement retrieval-augmented generation directly in C++ with this new lib plus llama. But (1) not sure where's the best place to implement it (as an example in llama.cpp or a downstream project etc.), and (2) maintenance burden.

Conclusion: My suggestion is to go on with @ggerganov's suggestion to build a llava lib as a part of examples in llama.cpp as a temporary solution but gradually (and hopefully quickly) move towards the full embeddings / multimodal / RAG experience as I can see that the community is more interested in such use cases than I thought in the first place. However, if this option does not work for your case, I'm also willing to implement it in such a way that you can link to. Community interests are what matters.

abetlen · 2023-11-01T23:05:18Z

@ggerganov I like that idea, as long as I can link to another shared library it's all good.

This approach would also be a good path to support things like finetuning directly from the python bindings without bloating the llama.h api surface.

monatis · 2023-11-06T05:42:47Z

@abetlen ggerganov/llama.cpp#3613 is ready for testing --does it cover your use case?

abetlen · 2023-11-08T06:05:32Z

Just published v0.2.15 with multimodal support.

Thank you so much @damian0815 and @monatis!

abc2cba · 2024-01-03T11:28:20Z

It seems the lllama.cpp have supported offload clip model to GPU, will it be supported for llama_cpp_python?

rlancemartin changed the title ~~Support for multi-modal modals~~ Support for multi-modal models Oct 12, 2023

rlancemartin mentioned this issue Oct 14, 2023

MM-ReAct (Multimodal Reasoning and Action) langchain-ai/langchain#2262

Closed

ggerganov mentioned this issue Nov 2, 2023

Expose Llava as a shared library for downstream projects ggerganov/llama.cpp#3613

Merged

5 tasks

abetlen pinned this issue Nov 6, 2023

abetlen closed this as completed Nov 8, 2023

abetlen unpinned this issue Nov 8, 2023

light-and-ray mentioned this issue Nov 8, 2023

Update llama.cpp for multimodal llava gguf support oobabooga/text-generation-webui#4301

Closed

ggerganov mentioned this issue Nov 13, 2023

clip : offload to GPU ggerganov/llama.cpp#4061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multi-modal models #813

Support for multi-modal models #813

rlancemartin commented Oct 11, 2023 •

edited

Loading

Josh-XT commented Oct 12, 2023

rlancemartin commented Oct 12, 2023

damian0815 commented Oct 13, 2023

marscod commented Oct 15, 2023 •

edited

Loading

sagar-kris commented Oct 20, 2023 •

edited

Loading

Josh-XT commented Oct 20, 2023

z3ugma commented Oct 30, 2023 •

edited

Loading

rlancemartin commented Oct 30, 2023

z3ugma commented Oct 30, 2023

ggerganov commented Nov 1, 2023

monatis commented Nov 1, 2023

abetlen commented Nov 1, 2023

monatis commented Nov 6, 2023

abetlen commented Nov 8, 2023

abc2cba commented Jan 3, 2024 •

edited

Loading

Support for multi-modal models #813

Support for multi-modal models #813

Comments

rlancemartin commented Oct 11, 2023 • edited Loading

Josh-XT commented Oct 12, 2023

rlancemartin commented Oct 12, 2023

damian0815 commented Oct 13, 2023

marscod commented Oct 15, 2023 • edited Loading

sagar-kris commented Oct 20, 2023 • edited Loading

Josh-XT commented Oct 20, 2023

z3ugma commented Oct 30, 2023 • edited Loading

rlancemartin commented Oct 30, 2023

z3ugma commented Oct 30, 2023

ggerganov commented Nov 1, 2023

monatis commented Nov 1, 2023

abetlen commented Nov 1, 2023

monatis commented Nov 6, 2023

abetlen commented Nov 8, 2023

abc2cba commented Jan 3, 2024 • edited Loading

rlancemartin commented Oct 11, 2023 •

edited

Loading

marscod commented Oct 15, 2023 •

edited

Loading

sagar-kris commented Oct 20, 2023 •

edited

Loading

z3ugma commented Oct 30, 2023 •

edited

Loading

abc2cba commented Jan 3, 2024 •

edited

Loading