Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Minicpm-omni in image understanding #11289

Merged
merged 9 commits into from
Jan 22, 2025

Conversation

tc-mb
Copy link
Contributor

@tc-mb tc-mb commented Jan 18, 2025

Hello, we are the previous MiniCPM-V series team, and we have also adapted all our previous models to the llama.cpp framework.

This week we launched MiniCPM-o 2.6. We named the model "A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone", but it is not easy for everyone to really get omni effect on their own end-side devices. We hope to adapt minicpm-omni to an efficient inference framework such as llama.cpp, and we also hope that the llama.cpp team will be interested in the merger of minicpm-omni. We believe this will also help to further expand the influence of llama.cpp.

In the past few months, we have modified many parts of llama.cpp and used it as an inference framework, so that our model can run on an iPad. There is a video example . I am still organizing our code, and I will submit a PR to adapt these functions in the next period of time. This PR is only used to support MiniCPM-omni's image understanding capabilities.

@github-actions github-actions bot added examples python python script changes labels Jan 18, 2025
@ggerganov
Copy link
Owner

Thank you for the feedback and for using llama.cpp for the Minicpm-omni models. Your work is very useful for initial PoC and a reference of how to implement multimodal support in the project.

Note that the existing vision code in the examples is merely just an example and it has many deficiencies. The long-term goal is to add multimodal capabilities in the core libllama and provide more robust and efficient implementation for vision and audio models. This hasn't been a primary focus of the project, mainly because it requires engineering resources to design the software architecture in a proper way to support this. As the developer community around the project grows, I'm hoping that we will find the necessary resources and will be able to implement multimodal support natively in the project. There are already ongoing efforts in this direction #11292.

With that said, any additional insights from your work with llama.cpp are welcome. Please note that the code in the existing vision examples is far from optimal and most of the proposed changes are merged without extensive reviews and without plans to maintain them.

@ngxson
Copy link
Collaborator

ngxson commented Jan 20, 2025

Thanks @tc-mb for the implementation. I'll have a look once I got the minicpm-v conversion to work correctly (I'm merging it with convert_hf_to_gguf.py)

Btw, just want to note that in the future, has_minicpmv_projector will be replaced by clip_projector_type:

enum clip_projector_type {
    CLIP_PROJECTOR_TYPE_UNKNOWN,
    CLIP_PROJECTOR_TYPE_MLP,
    CLIP_PROJECTOR_TYPE_LDPV2,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_5,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_6,
    CLIP_PROJECTOR_TYPE_MINICPMO_2_6, // to be added
};

@tc-mb
Copy link
Contributor Author

tc-mb commented Jan 21, 2025

@ggerganov Thank you for your reply.
As you said, the current examples are not enough to support more modifications. I also know that the minicpmv version code merged into llama.cpp is not elegant enough.
The omni code I prepared not only includes vision, but also audio and speech. The inference process is also divided into ordinary inference and stream inference. These logics are too complicated to be submitted in one pr. I think I will make a omni folder in the next pr to support the basic functions of omni.
Just as I respected your suggestions before and re-integrated the code of minicpm-vision, I am very willing to get your suggestions on how to implement the architecture of omni. Of course, we can discuss it after my omni capability pr is submitted, and it is no problem to modify it at that time.
But in this pr, I reused the previous code, with as little modification as possible, and let the community run the image understanding ability of minicpm-o 2.6. I suggest that this pr can be regarded as a simple version upgrade of minicpm-vision. You can merge this pr without too much burden.
I think the integration of the new vision API you mentioned is very good and it is a reflection of llama.cpp`s vitality. I think I will help with the minicpm-vision part.

@tc-mb
Copy link
Contributor Author

tc-mb commented Jan 21, 2025

Thanks @tc-mb for the implementation. I'll have a look once I got the minicpm-v conversion to work correctly (I'm merging it with convert_hf_to_gguf.py)

Btw, just want to note that in the future, has_minicpmv_projector will be replaced by clip_projector_type:

enum clip_projector_type {
    CLIP_PROJECTOR_TYPE_UNKNOWN,
    CLIP_PROJECTOR_TYPE_MLP,
    CLIP_PROJECTOR_TYPE_LDPV2,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_5,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_6,
    CLIP_PROJECTOR_TYPE_MINICPMO_2_6, // to be added
};

Cool, thanks.
I was busy in minicpm-omni a while ago, so I didn't notice that you've been making a lot of changes. I'll learn about your new refactoring this week and see if I can commit something for your work. Since I don't refresh llama.cpp all the time, you can ask me to help by @me directly. ^_^

@ggerganov
Copy link
Owner

I am very willing to get your suggestions on how to implement the architecture of omni.

The first steps towards this is to make a proper refactoring of the libllama so that we can introduce new logic in llama_context more easily. The llama_batch API should also be improved to become more generic and multi-modal oriented. Details are still being discussed and we don't have a detailed roadmap, but some of the discussions to follow are:

examples/llava/clip.cpp Outdated Show resolved Hide resolved
examples/llava/clip.cpp Outdated Show resolved Hide resolved
@ggerganov ggerganov merged commit 3e3357f into ggerganov:master Jan 22, 2025
47 checks passed
@ngxson
Copy link
Collaborator

ngxson commented Jan 22, 2025

@tc-mb I'm having some troubles with my implementation of minicpm-v 2.6 #11292

Currently, I'm able to convert the model to GGUF and encode one single image patch. However, because the <slices> part haven't work yet, the model behaves quite weird. Before proceeding further, I would like to ask you about certain points to better debug it.

Do you have a communication channel so I can reach out to you directly? (For example, email / slack / etc) Thank you.

@tc-mb
Copy link
Contributor Author

tc-mb commented Jan 23, 2025

@tc-mb I'm having some troubles with my implementation of minicpm-v 2.6 #11292

Currently, I'm able to convert the model to GGUF and encode one single image patch. However, because the <slices> part haven't work yet, the model behaves quite weird. Before proceeding further, I would like to ask you about certain points to better debug it.

Do you have a communication channel so I can reach out to you directly? (For example, email / slack / etc) Thank you.

Yes, this is indeed a bit more complicated. After slicing the image, we need to add some special tokens for segmentation. I will help you fix this.
My email address is 'caitianchi@modelbest.cn', you can email me at any time, but due to the time difference, I may reply a little later.

@tc-mb tc-mb deleted the minicpm-omni branch January 23, 2025 08:36
@tc-mb tc-mb restored the minicpm-omni branch January 23, 2025 08:36
anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025
* init

* add readme

* update readme

* no use make

* update readme

* update fix code

* fix editorconfig-checker

* no change convert py

* use clip_image_u8_free
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants