Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-modal models #813

Closed
rlancemartin opened this issue Oct 11, 2023 · 15 comments
Closed

Support for multi-modal models #813

rlancemartin opened this issue Oct 11, 2023 · 15 comments

Comments

@rlancemartin
Copy link

rlancemartin commented Oct 11, 2023

I see LLama.cpp is working on multi-modal models like LLaVA:
ggerganov/llama.cpp#3436

Model is here:

2ab9be51b7dc737136b38093316a4d3577d1fb96281f1589adac7841f5b81c43  ../models/ggml-model-q5_k.gguf
b7c8ff0f58fca47d28ba92c4443adf8653f3349282cb8d9e6911f22d9b3814fe  ../models/mmproj-model-f16.gguf

Testing:

$ mkdir build && cd build && cmake ..
$ cmake --build .
$ ./bin/llava -m ../models/ggml-model-q5_k.gguf --mmproj ../models/mmproj-model-f16.gguf --image ~/Desktop/Papers/figure-3-1.jpg

Appears to add some new params:

--mmproj MMPROJ_FILE  path to a multimodal projector file for LLaVA. see examples/llava/README.md
--image IMAGE_FILE    path to an image file. use with multimodal models

It would be awesome if we can support in llama-cpp-python.

@rlancemartin rlancemartin changed the title Support for multi-modal modals Support for multi-modal models Oct 12, 2023
@Josh-XT
Copy link
Contributor

Josh-XT commented Oct 12, 2023

That PR has been merged as of about an hour ago now.

@rlancemartin
Copy link
Author

@abetlen any thoughts on this?

@damian0815
Copy link
Contributor

i'm working on a strategy for supporting this by adding base64 image support to llama.cpp - this will mean the image can just be passed within the prompt. cf https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal

@marscod
Copy link

marscod commented Oct 15, 2023

I have the latest version 0.2.11, with built llava from lama.cpp and getting this error:

from llama_cpp import (Llama, clip_model_load, llava_image_embed_make_with_filename, llava_image_embed_make_with_bytes,
ImportError: cannot import name 'clip_model_load' from 'llama_cpp'

@sagar-kris
Copy link

sagar-kris commented Oct 20, 2023

@marscod v0.2.11 doesn't have the latest merged changes yet, try

pip uninstall llama-cpp-python -y

and then to install with the latest merged changes

pip install -U --no-cache-dir llama-cpp-python@git+https://github.com/abetlen/llama-cpp-python.git@ef03d77b59718f7d422f24b15c653c2a57b087f3

@Josh-XT
Copy link
Contributor

Josh-XT commented Oct 20, 2023

@marscod v0.2.11 doesn't have the latest merged changes yet, try

pip uninstall llama-cpp-python -y

and then to install with the latest merged changes

pip install -U --no-cache-dir llama-cpp-python@git+https://github.com/abetlen/llama-cpp-python.git@ef03d77b59718f7d422f24b15c653c2a57b087f3

It hasn't been merged yet. #821

@z3ugma
Copy link

z3ugma commented Oct 30, 2023

I followed the direction to install at that commit ef03d77b5, how do I specify to use the llava instance and pass the mmproj file as an argument when using in Python code

from llama_cpp import Llama
llm = Llama(model_path="/Users/fred/llama.cpp/models/ggml-model-q4_k.gguf.1", mmproj=)
   from llama_cpp import (
ImportError: cannot import name 'clip_model_load' from 'llama_cpp' (venv/lib/python3.11/site-packages/llama_cpp/__init__.py)

@rlancemartin
Copy link
Author

I talked to @ggerganov: what is missing is a Python API to run the CLIP model. There are plans for CLIP support natively in llama.cpp as a new architecture and in this case projects like llama-cpp-python can simply use the existing API.

@z3ugma
Copy link

z3ugma commented Oct 30, 2023

@rlancemartin is that CLIP support natively in llama.cpp tracked under a specific issue or PR in the llama.cpp repo? cc @ggerganov

@ggerganov
Copy link

There was an attempt to implement a LLaVA API as part of the llama.cpp library here: ggerganov/llama.cpp#3613
But I don't really like the proposal, so I suggested to temporarily build a second library as part of the llava example until we support CLIP natively in llama.cpp (ggerganov/llama.cpp#3613 (comment))

Not sure if this approach (with the second llava lib) would work for llama-cpp-python though. On llama.cpp side, I think it would be rather easy to implement - just need to move some files around using the linked PR as a starting point and add build steps for the llava (or llava.cpp) lib. And then I guess llama-cpp-python would have to bind to both llama.cpp and llava libs

Other than this, we don't have a specific issue for tracking this yet. I think @monatis would be the best person to say what would it take to support CLIP arch straight into llama.cpp, as they implemented the current clip.cpp code.

@monatis
Copy link

monatis commented Nov 1, 2023

If we decide to implement CLIP arch straight into llama.cpp, the required work depends on whether we want to support the full architecture or just as a image encoder for multimodal models. In any case, I need to revisit the conversion script and finalize the key-value pairs and tensor names to be future-proof.

For multimodal only:

  1. Add image loading / preprocessing code. This will need linking stb-image.h to llama lib --not sure if we want this.
  2. Update the current inference code to re-use the existing functionality in llama lib, e.g., model loading, offloading etc.

For the full architecture, additionally:
3. Implement CLIP tokenization, which is slightly different from the existing tokenizations in llama

These all can take some time. I initially started to implement in lmm.cpp, but I know from my clip.cpp experience it takes some time to actively support / manage a project on solo. I also have other ideas to combine all embedding models (CLIP, BERT, E5 etc.) in a embeddings library and then implement retrieval-augmented generation directly in C++ with this new lib plus llama. But (1) not sure where's the best place to implement it (as an example in llama.cpp or a downstream project etc.), and (2) maintenance burden.

Conclusion: My suggestion is to go on with @ggerganov's suggestion to build a llava lib as a part of examples in llama.cpp as a temporary solution but gradually (and hopefully quickly) move towards the full embeddings / multimodal / RAG experience as I can see that the community is more interested in such use cases than I thought in the first place. However, if this option does not work for your case, I'm also willing to implement it in such a way that you can link to. Community interests are what matters.

@abetlen
Copy link
Owner

abetlen commented Nov 1, 2023

@ggerganov I like that idea, as long as I can link to another shared library it's all good.

This approach would also be a good path to support things like finetuning directly from the python bindings without bloating the llama.h api surface.

@monatis
Copy link

monatis commented Nov 6, 2023

@abetlen ggerganov/llama.cpp#3613 is ready for testing --does it cover your use case?

@abetlen abetlen pinned this issue Nov 6, 2023
@abetlen
Copy link
Owner

abetlen commented Nov 8, 2023

Just published v0.2.15 with multimodal support.

Thank you so much @damian0815 and @monatis!

@abc2cba
Copy link

abc2cba commented Jan 3, 2024

It seems the lllama.cpp have supported offload clip model to GPU, will it be supported for llama_cpp_python?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants