-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : second attempt to refactor vision API #11292
base: master
Are you sure you want to change the base?
Conversation
Hi @ggerganov @slaren , I would like to ask for an early review from you before proceeding further. What will be interesting to discuss here is the usage of the new API, as demo in the newly added
I'm already be able to make llava and mobilevlm working with Things that are different from the initial discussion in #8010 :
And things that are still messy and will need more works:
I would love to hear your opinions about this. Thank you! |
if (ctx.ctx_ggml) { | ||
ggml_free(ctx.ctx_ggml); | ||
} | ||
ggml_init_params params = { | ||
/*.mem_size =*/ ggml_tensor_overhead(), | ||
/*.mem_buffer =*/ NULL, | ||
/*.no_alloc =*/ true, | ||
}; | ||
ctx.ctx_ggml = ggml_init(params); | ||
ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node); | ||
ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft); | ||
ggml_backend_tensor_copy(output_node, ctx.output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Not sure if there is a better way, but I'm using a hacky solution here.
Without a dedicated context (and ggml_backend_tensor_copy
), the underlay buffer is realloc before the next llama_decode
, rendering the data unusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the vision part uses the same scheduler than the llama_context
, that's unavoidable. You could pre-allocate the tensor in a different buffer to avoid the copy, but that's an optimization that can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.
I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call |
@ngxson I'll take a closer look at this today and specifically how about how this could work with a cross-attention model like Llama 3.2 Vision 👍 One thing that is related to this work is something we discussed about how these models should be provided. I initially though that creating a single .gguf for Llama 3.2 which contained both the vision encoder and the language model would be the way to go, but as can be read in the linked discussion having separate models is probably a better solution. It would be great to get some clarification regarding this and if vision encoders should be separate .gguf models. |
@slaren In my first proposal, I made
|
Btw I have been repeatedly mentioned about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some thoughts that I have so far.
Continuing along the idea for having separate models and contexts for the encoder and the decoder. I think with proper llama_batch
abstraction we can have the following API:
// vision
patches0 = llama_vision_tokenize(ctx_enc_v, img0);
patches1 = llama_vision_tokenize(ctx_enc_v, img1);
llama_batch_add_image(batch_enc_v, patches0);
llama_batch_add_image(batch_enc_v, patches1);
llama_encode(ctx_enc_v, batch_enc_v);
embd_enc_v = llama_get_embeddings(ctx_enc_v);
// audio
mel0 = llama_audio_tokenize(ctx_enc_a, audio0);
mel1 = llama_audio_tokenize(ctx_enc_a, audio1);
llama_batch_add_audio(batch_enc_a, mel0);
llama_batch_add_audio(batch_enc_a, mel1);
llama_encode(ctx_enc_a, batch_enc_a);
embd_enc_a = llama_get_embeddings(ctx_enc_a);
// text + vision + audio
tokens0 = llama_tokenize(ctx_dec, tokens0);
tokens1 = llama_tokenize(ctx_dec, tokens1);
llama_batch_add_text (batch_dec, tokens0);
llama_batch_add_embd_image(batch_dec, embd_enc_v);
llama_batch_add_embd_audio(batch_dec, embd_enc_a);
llama_batch_add_text (batch_dec, tokens1);
llama_decode(ctx_dec, batch_dec);
For cross-attention models such as Llama 3.2 Vision and Whisper, the decoding context ctx_dec
could be initialized with a reference to the encoder context:
llama_context_params cparams_dec;
cparams_dec.ctx_cross[0] = ctx_enc_v;
cparams_dec.ctx_cross[1] = ctx_enc_a;
Edit: extended the example with audio input as well.
src/llama-vision.cpp
Outdated
static ggml_cgraph * clip_image_build_graph(clip_context & ctx, int batch_size, clip_image_size & image_size) { | ||
auto & model = *ctx.model; | ||
auto & hparams = ctx.model->hparams; | ||
|
||
const int hidden_size = hparams.hidden_size; | ||
const int n_head = hparams.n_head; | ||
const int d_head = hidden_size / n_head; | ||
const int patch_size = hparams.patch_size; | ||
const float eps = hparams.eps; | ||
const int num_patches = ((image_size.width / patch_size) * (image_size.height / patch_size)); | ||
const int num_positions = num_patches + (model.class_embedding ? 1 : 0); | ||
|
||
LLAMA_LOG_DEBUG("%s: num_patches = %d\n", __func__, num_patches); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clip graph should be constructed as any other graph in src/llama.cpp
, llm_build_context
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to do this right now, as I can't see how I can re-use existing build_*
to make the cgraph of vision models "blend-in" with the rest of llm_build_context
But what I did so far is to make an equivalent called llama_vision_graph_builder
. This meant to be a temporary solution, to simplify the migration in the near future.
Could you please have a look on my llama_vision_graph_builder
to see how it can be merged into llm_build_context
? Thanks!
src/llama-vision.cpp
Outdated
delete p; | ||
} | ||
|
||
int32_t llama_vision_encode(struct llama_context * ctx, llama_vision_patches * p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need separate function - we should be able to reuse llama_encode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think we can do this right now, as it requires llama_batch
to also accept image tokens.
Do you think it's ok to keep llama_vision_encode(llama_img_tokens &)
and refactor llama_batch
later on?
if (ctx.ctx_ggml) { | ||
ggml_free(ctx.ctx_ggml); | ||
} | ||
ggml_init_params params = { | ||
/*.mem_size =*/ ggml_tensor_overhead(), | ||
/*.mem_buffer =*/ NULL, | ||
/*.no_alloc =*/ true, | ||
}; | ||
ctx.ctx_ggml = ggml_init(params); | ||
ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node); | ||
ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft); | ||
ggml_backend_tensor_copy(output_node, ctx.output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.
src/llama-vision.cpp
Outdated
struct llama_vision_patches * llama_vision_patches_init( | ||
struct llama_context * ctx, | ||
llama_vision_bitmap * bmp) { | ||
clip_context & vctx = ctx->vctx; | ||
if (vctx.model->hparams.arch == VISION_ARCH_MINICPMV) { | ||
return new llama_vision_patches(clip_image_preprocess_minicpmv(vctx, *bmp)); | ||
} | ||
return new llama_vision_patches(clip_image_preprocess(vctx, *bmp)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches, since this would have to be performed on the CPU.
I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call llama_vision_encode and throw them away? If not, then maybe that could be hidden entirely from the user and llama_vision_encode could take directly an image.
Even though the user cannot explicitly operate with the patches, it seems to make sense to expose this in order to be able to multi-thread the pre-processing step.
Note that we should also consider the case of Whisper in the context of this abstraction. The whisper model takes raw input audio in PCM format, which is first pre-processed into a mel spectrogram. This pre-processing step, similar to the image pre-processing for CLIP and the text tokenization in text models, is performed on the CPU and can be multi-threaded. Of course, any of the three types of pre-processings could be implemented on the GPU with enough effort, but the important aspect is that this pre-processing can be done in parallel for different inputs and once computed, can be reused with different contexts.
In all cases, the pre-processed input is passed to the transformer graph and the first step is always to convert this input in embeddings. For text, this conversion is trivial - ggml_get_rows(w, tokens)
. For Whisper, this processes involves a couple of convolutions of the mel spectrogram:
For CLIP, this appears to be again a convolution operator applied to the pre-processed input (the image patches) in order to obtain the initial embeddings:
All these conversions of the pre-processed input (tokens, mel, patches) into the initial embeddings should be implemented in a single place: build_inp_embd()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches
Make sense then. I realized that I was always associate the notion of "token" with "text", but a quick google search tells that: "In LLMs, a token is a basic unit of input or output [...]"
In that sense, I would propose calling it llama_vision_img_tokens
(though, it can be a bit confused because user may expect it a std::vector
due to the plural "tokens")
// Structure represents the basic input unit of vision model
// This can be a processed image or slices of images under the hood
struct llama_vision_img_tokens;
// User must reserve N number of tokens in tokenized text prompt for each image
int32_t llama_vision_get_n_tokens(const llama_vision_img_tokens * img_tokens);
@ngxson Sorry about the delay. I've been able to "force" support for mllama using the latest vision api, that is get an example working. I'm now going to iterate on this and try to figure out how cross attention will work. Just wanted to let you know that some progress is being made. There is an issue I'm having with the vocab size which I'm not exactly sure how to handle. If anyone has some thoughts around this please let me know. |
@danbev No worries, I was busy with minicpm-v too. It's still not working now (inference works, but just missing the llava-uhd preprocessor). Will have a look on your implementation of mllama very soon. |
So, minicpm-v template is more complicated because it contains bot the image and all the slices. Here is what it looks like in
To get rid of this complication, my idea is to have the embeddings of these tokens ( This will make this formatting transparent to the text tokenizer, but will require embeddings of these tokens to be stored as one-hot vectors in the vision model (of course we can use |
Ok so I managed to get minicpm-v kinda work out of the box with the API (no changes to user-space code is required). Upon giving it win XP wallpaper bliss, it says: It currently operates with a resized version of the image (like llava), so the performance will be bad for bigger images (with more details). I'll get llava-uhd to work, which breaks the image into slices and thus allow the LLM to "see" the image at different zoom level, thus preserving details. |
I also got SmolVLM (tested with 500M model) to work with this API without any change to user-space code. The image preprocessor may not be 100% correct, but I'll discuss with SmolVLM team to learn more about it. For the bliss.jpg test:
|
So what remains to be done before this PR can be successfully merged? |
I'm going back to this PR, my goals for this week are:
|
@ngxson |
@ngxson looks very promising!
llama.cpp main branch builds fine, following this build instructions: |
yes
reconfigure your cmake, |
llama_vision_context * ctx = new llama_vision_context; | ||
ctx->model = &model->vit; | ||
|
||
// TODO: this looks ugly, mostly copied from llama.cpp, refactor it in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov So I finally got a first version of the vision API with dedicated context:
common_init_result llama_init = common_init_from_params(params);
const llama_model * model = llama_init.model.get();
llama_vision_context_params vparams = llama_vision_context_default_params();
llama_vision_context * vctx = llama_vision_init_from_model(model, vparams);
// then, use vctx
I don't yet have time to look deeper into your refactoring #11213 , but do you think the API that I introduced here can be benefit from that?
For context, what I'm currently missing in here is the code to setup various backends. Ideally, that should be abstract out into a "super" class, and llama_vision_init_from_model
should only be responsible of setting up output tensor, preprocessor, etc. So I just want to ask if you're having the same idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few notes:
llama_vision_context
should not exist in the public APIllama_vision_init_from_model
is the same asllama_init_from_model
, but it will internally create a derivedllama_vision_context : llama_context
based on the model parameters. It can of course create different contexts based both on the model parameters and the context parameters (e.g. different KV cache types, decoder vs encoder context, etc.).- Any vision-related context parameters should be appended to the existing
struct llama_context_params
. No need for newstruct llama_vision_context_params
. - The base
llama_context
would be responsible for setting up the the backend logic and distributing the model tensors. This should be the same for all models
I am wondering if we should first reimplement llama_batch
to become private so we can make more non-breaking changes to it. Ideally, the prompt processing in the new vision
example should be possible to do with a single llama_decode
on a multi-modal batch that we created something like this:
llama_batch_add_text (batch_dec, tokens0);
llama_batch_add_embd_image(batch_dec, embd_enc_v);
llama_batch_add_text (batch_dec, tokens1);
llama_decode(ctx, batch_dec);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the explanation, that seems good to me, I'll update this PR accordingly.
I am wondering if we should first reimplement
llama_batch
to become private so we can make more non-breaking changes to it.
Yes it would be very nice to have this. Indeed, this will also resolve #10381 which is currently blocked. Not sure if you gonna take this task or you prefer letting me make a draft PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, some of the suggestions are quite dependent on the changes in #11213. I'm a bit stuck unfortunately and haven't made progress lately. Depending on how things go this week, I will either finish the refactor or just scrape the changes and let other people attempt to work on this. I think the ideas there are sound, but it's not very easy for me to implement it.
Yes it would be very nice to have this. Indeed, this will also resolve #10381 which is currently blocked. Not sure if you gonna take this task or you prefer letting me make a draft PR?
Before #11213 is resolved, I don't plan to make changes to this part. So feel free to give it a shot.
SmolVLM 500M can already be run via the current PR, you should base on this, not the other one. |
First of all, great work!
|
Looking forward to seeing this get merged, a couple of other PRs seem depends on it. |
@agNihit928 I think something gets buggy when I rebase to latest master, you can maybe go back to c3a654c to see if it works. |
Sure @ngxson |
Fix #8010
Supersede #9687
To test this, please refer to #9687 to convert the model to GGUF.
Then,
Goals of this PR:
llama_vision
Processor
class on HF libraryThings that will be done in follow-up PRs: