-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support qwen2-vl #32318
support qwen2-vl #32318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your are missing a few files for the automapping to work! would recommend running transformers-cli add-new-model-like
and overwrite the config, md etc with what you have here!
Then you should be able to ping @zucchini-nlp for a review on this new multimodal model!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition! Yes, after adding auto maps and md files, feel free to tag for review. Let me know if you need any help with that
hi @zucchini-nlp, tidy all files and all test cases were passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Great to see more multimodal LLMs.
My main concern in the current implementation is the chat template format. I wouldn't recommend passing images/processing kwargs in the template. Also, we would need some changes to be consistent with transformers models, left more comments below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some last nits but should be good to go! 🔥
if images is not None: | ||
pixel_values, vision_grid_thws = [], [] | ||
for image in images: | ||
patches, image_grid_thw = self._preprocess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._preprocess already loops on the provided images, why are we not simply using self._preprocess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we batched images as one sequence since different image has different sequence length.
self.mlp = nn.Sequential( | ||
nn.Linear(self.hidden_size, self.hidden_size), | ||
nn.GELU(), | ||
nn.Linear(self.hidden_size, dim), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could also just use the VisionMlp with gelu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not want to reuse VisionMlp
here since they have different semantics.
for i in range(1, len(cu_seqlens)): | ||
attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can probably be vectorized, but good enough for not!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not trivial to be vectorized since the dynamic sequence length.
kv_seq_len = key_states.shape[-2] | ||
if past_key_value is not None: | ||
kv_seq_len += cache_position[0] + 1 | ||
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we use rotary_seq_len = cache_position[-1]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This segment of code is primarily copied from Qwen2. I've noticed some recent changes in the implementation. Would it be better to modify it like this to maintain consistency?
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah actually no for this part, as get_usable_length is "old" sorry for that. I was mostly commenting on the fact that cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
does not use the rotary seq length argument, while the FlashAttention used it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think line 763 should be kv_seq_len = cache_position[0] + 1
. Btw, this line seems to be useless Qwen2VLAttention
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RoPE for Qwen has been modified after this PR, so we don't reply on kv-length anymore, so yes the variable is useless now :)
Hi @ArthurZucker, I think we are all good for merging this PR? |
Totally forgot about this, can we swap order of input args for the processor so that it is 'images, text, ......' ? We are doing processor standardization and it'll be easier to have ot correct orders from the beginning, instead of deprecating one more model. I'll take of the whole standardization forQenVLProcessor kwargs later |
What is the correct order? Alphabetic order? |
No, it's just the inputs that should be in 'image, text, video ', whilw now it is 'text, images, video,...'. Then you can leave order of other kwargs as it is, we'll take care of the rest |
done. |
Yep gimme a minute to check the new changes and merge accordingly! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay one final nit and let's merge! 🔥
kv_seq_len = key_states.shape[-2] | ||
if past_key_value is not None: | ||
kv_seq_len += cache_position[0] + 1 | ||
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah actually no for this part, as get_usable_length is "old" sorry for that. I was mostly commenting on the fact that cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
does not use the rotary seq length argument, while the FlashAttention used it
Thanks a lot for bearing with me, we'll actually take care of changing that in another PR, let's merge 🤗 |
thanks a lot! I really appreciate you guys effort! |
* support-qwen2-vl * tidy * tidy * tidy * tidy * tidy * tidy * tidy * hyphen->underscore * make style * add-flash2-tipd * delete-tokenize=False * remove-image_processor-in-init-file * add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES * format-doct * support-Qwen2VLVisionConfig * remove-standardize_cache_format * fix-letter-varaibles * remove-torch-in-image-processor * remove-useless-docstring * fix-one-letter-varaible-name * change-block-name * default-quick-gelu-in-vision * remove-useless-doc * use-preimplemented-flash-forward * fix-doc * fix-image-processing-doc * fix-apply-rotary-embed * fix-flash-attn-sliding-window * refactor * remove-default_template * remove-reorder_cache * simple-get-rope_deltas * update-prepare_inputs_for_generation * update-attention-mask * update-rotary_seq_len * remove-state * kv_seq_length * remove-warning * _supports_static_cache * remove-legacy-cache * refactor * fix-replace * mrope-section-doc * code-quality * code-quality * polish-doc * fix-image-processing-test * update readme * Update qwen2_vl.md * fix-test * Update qwen2_vl.md * nit * processor-kwargs * hard-code-norm_layer * code-quality * discard-pixel-values-in-gen * fix-inconsistent-error-msg * unify-image-video * hidden_act * add-docstring * vision-encode-as-PreTrainedModel * pixel-to-target-dtype * update doc and low memoryvit * format * format * channel-foramt * fix vit_flashatt * format * inherit-Qwen2VLPreTrainedModel * simplify * format-test * remove-one-line-func-in-image-processing * avoid-one-line-reshape * simplify-rotary_seq_len * avoid-single-letter-variable * no-for-loop-sdpa * avoid-single-letter-variable * remove-one-line-reshape * remove-one-line-reshape * remove-no-rope-in-vit-logic * default-mrope * add-copied-from * more-docs-for-mrope * polish-doc * comment-and-link * polish-doc * single-letter-variables * simplify-image-processing * video->images * kv_seq_len-update * vision-rope-on-the-fly * vision-eager-attention * change-processor-order --------- Co-authored-by: baishuai <baishuai.bs@alibaba-inc.com> Co-authored-by: ShuaiBai623 <43326198+ShuaiBai623@users.noreply.github.com>
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.