-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support Minicpm-omni in image understanding #11289
Conversation
Thank you for the feedback and for using Note that the existing vision code in the examples is merely just an example and it has many deficiencies. The long-term goal is to add multimodal capabilities in the core With that said, any additional insights from your work with |
Thanks @tc-mb for the implementation. I'll have a look once I got the minicpm-v conversion to work correctly (I'm merging it with Btw, just want to note that in the future, enum clip_projector_type {
CLIP_PROJECTOR_TYPE_UNKNOWN,
CLIP_PROJECTOR_TYPE_MLP,
CLIP_PROJECTOR_TYPE_LDPV2,
CLIP_PROJECTOR_TYPE_MINICPMV_2_5,
CLIP_PROJECTOR_TYPE_MINICPMV_2_6,
CLIP_PROJECTOR_TYPE_MINICPMO_2_6, // to be added
}; |
@ggerganov Thank you for your reply. |
Cool, thanks. |
The first steps towards this is to make a proper refactoring of the |
@tc-mb I'm having some troubles with my implementation of minicpm-v 2.6 #11292 Currently, I'm able to convert the model to GGUF and encode one single image patch. However, because the Do you have a communication channel so I can reach out to you directly? (For example, email / slack / etc) Thank you. |
Yes, this is indeed a bit more complicated. After slicing the image, we need to add some special tokens for segmentation. I will help you fix this. |
* init * add readme * update readme * no use make * update readme * update fix code * fix editorconfig-checker * no change convert py * use clip_image_u8_free
Hello, we are the previous MiniCPM-V series team, and we have also adapted all our previous models to the llama.cpp framework.
This week we launched MiniCPM-o 2.6. We named the model "A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone", but it is not easy for everyone to really get omni effect on their own end-side devices. We hope to adapt minicpm-omni to an efficient inference framework such as llama.cpp, and we also hope that the llama.cpp team will be interested in the merger of minicpm-omni. We believe this will also help to further expand the influence of llama.cpp.
In the past few months, we have modified many parts of llama.cpp and used it as an inference framework, so that our model can run on an iPad. There is a video example . I am still organizing our code, and I will submit a PR to adapt these functions in the next period of time. This PR is only used to support MiniCPM-omni's image understanding capabilities.