support Minicpm-omni in image understanding #11289

tc-mb · 2025-01-18T14:40:18Z

Hello, we are the previous MiniCPM-V series team, and we have also adapted all our previous models to the llama.cpp framework.

This week we launched MiniCPM-o 2.6. We named the model "A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone", but it is not easy for everyone to really get omni effect on their own end-side devices. We hope to adapt minicpm-omni to an efficient inference framework such as llama.cpp, and we also hope that the llama.cpp team will be interested in the merger of minicpm-omni. We believe this will also help to further expand the influence of llama.cpp.

In the past few months, we have modified many parts of llama.cpp and used it as an inference framework, so that our model can run on an iPad. There is a video example . I am still organizing our code, and I will submit a PR to adapt these functions in the next period of time. This PR is only used to support MiniCPM-omni's image understanding capabilities.

ggerganov · 2025-01-20T08:10:50Z

Thank you for the feedback and for using llama.cpp for the Minicpm-omni models. Your work is very useful for initial PoC and a reference of how to implement multimodal support in the project.

Note that the existing vision code in the examples is merely just an example and it has many deficiencies. The long-term goal is to add multimodal capabilities in the core libllama and provide more robust and efficient implementation for vision and audio models. This hasn't been a primary focus of the project, mainly because it requires engineering resources to design the software architecture in a proper way to support this. As the developer community around the project grows, I'm hoping that we will find the necessary resources and will be able to implement multimodal support natively in the project. There are already ongoing efforts in this direction #11292.

With that said, any additional insights from your work with llama.cpp are welcome. Please note that the code in the existing vision examples is far from optimal and most of the proposed changes are merged without extensive reviews and without plans to maintain them.

ngxson · 2025-01-20T10:26:52Z

Thanks @tc-mb for the implementation. I'll have a look once I got the minicpm-v conversion to work correctly (I'm merging it with convert_hf_to_gguf.py)

Btw, just want to note that in the future, has_minicpmv_projector will be replaced by clip_projector_type:

enum clip_projector_type {
    CLIP_PROJECTOR_TYPE_UNKNOWN,
    CLIP_PROJECTOR_TYPE_MLP,
    CLIP_PROJECTOR_TYPE_LDPV2,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_5,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_6,
    CLIP_PROJECTOR_TYPE_MINICPMO_2_6, // to be added
};

tc-mb · 2025-01-21T05:17:56Z

@ggerganov Thank you for your reply.
As you said, the current examples are not enough to support more modifications. I also know that the minicpmv version code merged into llama.cpp is not elegant enough.
The omni code I prepared not only includes vision, but also audio and speech. The inference process is also divided into ordinary inference and stream inference. These logics are too complicated to be submitted in one pr. I think I will make a omni folder in the next pr to support the basic functions of omni.
Just as I respected your suggestions before and re-integrated the code of minicpm-vision, I am very willing to get your suggestions on how to implement the architecture of omni. Of course, we can discuss it after my omni capability pr is submitted, and it is no problem to modify it at that time.
But in this pr, I reused the previous code, with as little modification as possible, and let the community run the image understanding ability of minicpm-o 2.6. I suggest that this pr can be regarded as a simple version upgrade of minicpm-vision. You can merge this pr without too much burden.
I think the integration of the new vision API you mentioned is very good and it is a reflection of llama.cpp`s vitality. I think I will help with the minicpm-vision part.

tc-mb · 2025-01-21T05:30:12Z

Thanks @tc-mb for the implementation. I'll have a look once I got the minicpm-v conversion to work correctly (I'm merging it with convert_hf_to_gguf.py)

Btw, just want to note that in the future, has_minicpmv_projector will be replaced by clip_projector_type:
enum clip_projector_type {
    CLIP_PROJECTOR_TYPE_UNKNOWN,
    CLIP_PROJECTOR_TYPE_MLP,
    CLIP_PROJECTOR_TYPE_LDPV2,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_5,
    CLIP_PROJECTOR_TYPE_MINICPMV_2_6,
    CLIP_PROJECTOR_TYPE_MINICPMO_2_6, // to be added
};

Cool, thanks.
I was busy in minicpm-omni a while ago, so I didn't notice that you've been making a lot of changes. I'll learn about your new refactoring this week and see if I can commit something for your work. Since I don't refresh llama.cpp all the time, you can ask me to help by @me directly. ^_^

ggerganov · 2025-01-21T07:07:35Z

I am very willing to get your suggestions on how to implement the architecture of omni.

The first steps towards this is to make a proper refactoring of the libllama so that we can introduce new logic in llama_context more easily. The llama_batch API should also be improved to become more generic and multi-modal oriented. Details are still being discussed and we don't have a detailed roadmap, but some of the discussions to follow are:

examples/llava/clip.cpp

ngxson · 2025-01-22T21:56:41Z

@tc-mb I'm having some troubles with my implementation of minicpm-v 2.6 #11292

Currently, I'm able to convert the model to GGUF and encode one single image patch. However, because the <slices> part haven't work yet, the model behaves quite weird. Before proceeding further, I would like to ask you about certain points to better debug it.

Do you have a communication channel so I can reach out to you directly? (For example, email / slack / etc) Thank you.

tc-mb · 2025-01-23T06:16:15Z

@tc-mb I'm having some troubles with my implementation of minicpm-v 2.6 #11292

Currently, I'm able to convert the model to GGUF and encode one single image patch. However, because the <slices> part haven't work yet, the model behaves quite weird. Before proceeding further, I would like to ask you about certain points to better debug it.

Do you have a communication channel so I can reach out to you directly? (For example, email / slack / etc) Thank you.

Yes, this is indeed a bit more complicated. After slicing the image, we need to add some special tokens for segmentation. I will help you fix this.
My email address is 'caitianchi@modelbest.cn', you can email me at any time, but due to the time difference, I may reply a little later.

* init * add readme * update readme * no use make * update readme * update fix code * fix editorconfig-checker * no change convert py * use clip_image_u8_free

tc-mb added 6 commits January 13, 2025 22:00

init

6e9ebda

add readme

ee3fae8

update readme

ed02c70

no use make

52c6abe

update readme

c69f9ba

update fix code

27d0ec8

github-actions bot added examples python python script changes labels Jan 18, 2025

tc-mb added 2 commits January 18, 2025 23:49

fix editorconfig-checker

1366444

no change convert py

fee2aa6

ggerganov approved these changes Jan 21, 2025

View reviewed changes

examples/llava/clip.cpp Outdated Show resolved Hide resolved

examples/llava/clip.cpp Outdated Show resolved Hide resolved

use clip_image_u8_free

27f5e8a

tc-mb mentioned this pull request Jan 21, 2025

[BUG] minicpm-v:8b-2.6-q4_K_M Some pictures reported errors OpenBMB/MiniCPM-o#721

Open

2 tasks

ggerganov merged commit 3e3357f into ggerganov:master Jan 22, 2025
47 checks passed

tc-mb deleted the minicpm-omni branch January 23, 2025 08:36

tc-mb restored the minicpm-omni branch January 23, 2025 08:36

Snify89 mentioned this pull request Jan 23, 2025

Requesting this new multimodal model. ollama/ollama#8496

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support Minicpm-omni in image understanding #11289

support Minicpm-omni in image understanding #11289

tc-mb commented Jan 18, 2025

ggerganov commented Jan 20, 2025

ngxson commented Jan 20, 2025 •

edited

Loading

tc-mb commented Jan 21, 2025

tc-mb commented Jan 21, 2025

ggerganov commented Jan 21, 2025

ngxson commented Jan 22, 2025 •

edited

Loading

tc-mb commented Jan 23, 2025

support Minicpm-omni in image understanding #11289

support Minicpm-omni in image understanding #11289

Conversation

tc-mb commented Jan 18, 2025

ggerganov commented Jan 20, 2025

ngxson commented Jan 20, 2025 • edited Loading

tc-mb commented Jan 21, 2025

tc-mb commented Jan 21, 2025

ggerganov commented Jan 21, 2025

ngxson commented Jan 22, 2025 • edited Loading

tc-mb commented Jan 23, 2025

ngxson commented Jan 20, 2025 •

edited

Loading

ngxson commented Jan 22, 2025 •

edited

Loading