-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi-modal models #813
Comments
That PR has been merged as of about an hour ago now. |
@abetlen any thoughts on this? |
i'm working on a strategy for supporting this by adding base64 image support to llama.cpp - this will mean the image can just be passed within the prompt. cf https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal |
I have the latest version from llama_cpp import (Llama, clip_model_load, llava_image_embed_make_with_filename, llava_image_embed_make_with_bytes,
ImportError: cannot import name 'clip_model_load' from 'llama_cpp' |
@marscod v0.2.11 doesn't have the latest merged changes yet, try
and then to install with the latest merged changes
|
It hasn't been merged yet. #821 |
I followed the direction to install at that commit
|
I talked to @ggerganov: what is missing is a Python API to run the CLIP model. There are plans for CLIP support natively in llama.cpp as a new architecture and in this case projects like llama-cpp-python can simply use the existing API. |
@rlancemartin is that |
There was an attempt to implement a LLaVA API as part of the Not sure if this approach (with the second Other than this, we don't have a specific issue for tracking this yet. I think @monatis would be the best person to say what would it take to support CLIP arch straight into |
If we decide to implement CLIP arch straight into llama.cpp, the required work depends on whether we want to support the full architecture or just as a image encoder for multimodal models. In any case, I need to revisit the conversion script and finalize the key-value pairs and tensor names to be future-proof. For multimodal only:
For the full architecture, additionally: These all can take some time. I initially started to implement in lmm.cpp, but I know from my clip.cpp experience it takes some time to actively support / manage a project on solo. I also have other ideas to combine all embedding models (CLIP, BERT, E5 etc.) in a Conclusion: My suggestion is to go on with @ggerganov's suggestion to build a |
@ggerganov I like that idea, as long as I can link to another shared library it's all good. This approach would also be a good path to support things like finetuning directly from the python bindings without bloating the |
@abetlen ggerganov/llama.cpp#3613 is ready for testing --does it cover your use case? |
Just published v0.2.15 with multimodal support. Thank you so much @damian0815 and @monatis! |
It seems the lllama.cpp have supported offload clip model to GPU, will it be supported for llama_cpp_python? |
I see LLama.cpp is working on multi-modal models like LLaVA:
ggerganov/llama.cpp#3436
Model is here:
Testing:
Appears to add some new params:
It would be awesome if we can support in
llama-cpp-python
.The text was updated successfully, but these errors were encountered: