support qwen2-vl #32318

simonJJJ · 2024-07-30T09:00:35Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…qwen2_vl

ArthurZucker

Your are missing a few files for the automapping to work! would recommend running transformers-cli add-new-model-like and overwrite the config, md etc with what you have here!

Then you should be able to ping @zucchini-nlp for a review on this new multimodal model!

zucchini-nlp

Great addition! Yes, after adding auto maps and md files, feel free to tag for review. Let me know if you need any help with that

simonJJJ · 2024-08-01T06:39:55Z

hi @zucchini-nlp, tidy all files and all test cases were passed.

zucchini-nlp

Thanks for working on this! Great to see more multimodal LLMs.

My main concern in the current implementation is the chat template format. I wouldn't recommend passing images/processing kwargs in the template. Also, we would need some changes to be consistent with transformers models, left more comments below

docs/source/en/perf_infer_gpu_one.md

docs/source/en/model_doc/qwen2-vl.md

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

…qwen2_vl

ArthurZucker

Some last nits but should be good to go! 🔥

src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py

ArthurZucker · 2024-08-23T15:33:46Z

src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py

+        if images is not None:
+            pixel_values, vision_grid_thws = [], []
+            for image in images:
+                patches, image_grid_thw = self._preprocess(


self._preprocess already loops on the provided images, why are we not simply using self._preprocess?

we batched images as one sequence since different image has different sequence length.

src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

ArthurZucker · 2024-08-23T15:38:53Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        self.mlp = nn.Sequential(
+            nn.Linear(self.hidden_size, self.hidden_size),
+            nn.GELU(),
+            nn.Linear(self.hidden_size, dim),
+        )


this could also just use the VisionMlp with gelu

we do not want to reuse VisionMlp here since they have different semantics.

ArthurZucker · 2024-08-23T15:39:39Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        for i in range(1, len(cu_seqlens)):
+            attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = True


this can probably be vectorized, but good enough for not!

it's not trivial to be vectorized since the dynamic sequence length.

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

ArthurZucker · 2024-08-23T15:43:04Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += cache_position[0] + 1
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)


why don't we use rotary_seq_len = cache_position[-1] here?

This segment of code is primarily copied from Qwen2. I've noticed some recent changes in the implementation. Would it be better to modify it like this to maintain consistency?

kv_seq_len = key_states.shape[-2] if past_key_value is not None: kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)

ah actually no for this part, as get_usable_length is "old" sorry for that. I was mostly commenting on the fact that cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) does not use the rotary seq length argument, while the FlashAttention used it

I think line 763 should be kv_seq_len = cache_position[0] + 1. Btw, this line seems to be useless Qwen2VLAttention.

RoPE for Qwen has been modified after this PR, so we don't reply on kv-length anymore, so yes the variable is useless now :)

simonJJJ · 2024-08-26T09:51:56Z

Hi @ArthurZucker, I think we are all good for merging this PR?

zucchini-nlp · 2024-08-26T12:08:48Z

Totally forgot about this, can we swap order of input args for the processor so that it is 'images, text, ......' ? We are doing processor standardization and it'll be easier to have ot correct orders from the beginning, instead of deprecating one more model. I'll take of the whole standardization forQenVLProcessor kwargs later

simonJJJ · 2024-08-26T12:15:15Z

Totally forgot about this, can we swap order of input args for the processor so that it is 'images, text, ......' ? We are doing processor standardization and it'll be easier to have ot correct orders from the beginning, instead of deprecating one more model. I'll take of the whole standardization forQenVLProcessor kwargs later

What is the correct order? Alphabetic order?

zucchini-nlp · 2024-08-26T12:44:39Z

No, it's just the inputs that should be in 'image, text, video ', whilw now it is 'text, images, video,...'. Then you can leave order of other kwargs as it is, we'll take care of the rest

simonJJJ · 2024-08-26T12:56:33Z

No, it's just the inputs that should be in 'image, text, video ', whilw now it is 'text, images, video,...'. Then you can leave order of other kwargs as it is, we'll take care of the rest

done.

ArthurZucker · 2024-08-26T13:11:49Z

Yep gimme a minute to check the new changes and merge accordingly!

ArthurZucker

okay one final nit and let's merge! 🔥

ArthurZucker · 2024-08-26T13:14:27Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += cache_position[0] + 1
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)


ah actually no for this part, as get_usable_length is "old" sorry for that. I was mostly commenting on the fact that cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) does not use the rotary seq length argument, while the FlashAttention used it

ArthurZucker · 2024-08-26T13:16:30Z

Thanks a lot for bearing with me, we'll actually take care of changing that in another PR, let's merge 🤗
Compile won't be supported out of the box but it's alright otherwise!
Congrats team for the awesome model!

simonJJJ · 2024-08-26T13:18:19Z

Thanks a lot for bearing with me, we'll actually take care of changing that in another PR, let's merge 🤗 Compile won't be supported out of the box but it's alright otherwise! Congrats team for the awesome model!

thanks a lot! I really appreciate you guys effort!

* support-qwen2-vl * tidy * tidy * tidy * tidy * tidy * tidy * tidy * hyphen->underscore * make style * add-flash2-tipd * delete-tokenize=False * remove-image_processor-in-init-file * add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES * format-doct * support-Qwen2VLVisionConfig * remove-standardize_cache_format * fix-letter-varaibles * remove-torch-in-image-processor * remove-useless-docstring * fix-one-letter-varaible-name * change-block-name * default-quick-gelu-in-vision * remove-useless-doc * use-preimplemented-flash-forward * fix-doc * fix-image-processing-doc * fix-apply-rotary-embed * fix-flash-attn-sliding-window * refactor * remove-default_template * remove-reorder_cache * simple-get-rope_deltas * update-prepare_inputs_for_generation * update-attention-mask * update-rotary_seq_len * remove-state * kv_seq_length * remove-warning * _supports_static_cache * remove-legacy-cache * refactor * fix-replace * mrope-section-doc * code-quality * code-quality * polish-doc * fix-image-processing-test * update readme * Update qwen2_vl.md * fix-test * Update qwen2_vl.md * nit * processor-kwargs * hard-code-norm_layer * code-quality * discard-pixel-values-in-gen * fix-inconsistent-error-msg * unify-image-video * hidden_act * add-docstring * vision-encode-as-PreTrainedModel * pixel-to-target-dtype * update doc and low memoryvit * format * format * channel-foramt * fix vit_flashatt * format * inherit-Qwen2VLPreTrainedModel * simplify * format-test * remove-one-line-func-in-image-processing * avoid-one-line-reshape * simplify-rotary_seq_len * avoid-single-letter-variable * no-for-loop-sdpa * avoid-single-letter-variable * remove-one-line-reshape * remove-one-line-reshape * remove-no-rope-in-vit-logic * default-mrope * add-copied-from * more-docs-for-mrope * polish-doc * comment-and-link * polish-doc * single-letter-variables * simplify-image-processing * video->images * kv_seq_len-update * vision-rope-on-the-fly * vision-eager-attention * change-processor-order --------- Co-authored-by: baishuai <baishuai.bs@alibaba-inc.com> Co-authored-by: ShuaiBai623 <43326198+ShuaiBai623@users.noreply.github.com>

simonJJJ added 2 commits July 30, 2024 16:55

support-qwen2-vl

a8c38a8

Merge branch 'main' of https://github.com/simonJJJ/transformers into …

3caad96

…qwen2_vl

ArthurZucker reviewed Jul 31, 2024

View reviewed changes

ArthurZucker added the New model label Jul 31, 2024

zucchini-nlp reviewed Jul 31, 2024

View reviewed changes

simonJJJ added 7 commits July 31, 2024 23:11

tidy

779d9da

tidy

382b0bc

tidy

d6bd095

tidy

a74ee73

tidy

d585b01

tidy

9b1d485

tidy

774a5bc

simonJJJ requested a review from zucchini-nlp August 1, 2024 06:39

zucchini-nlp reviewed Aug 2, 2024

View reviewed changes

simonJJJ added 15 commits August 5, 2024 14:06

hyphen->underscore

b7a6567

make style

8b8f37e

add-flash2-tipd

ec66f42

delete-tokenize=False

b8fae77

remove-image_processor-in-init-file

9262416

add-qwen2_vl-in-MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES

283b03c

format-doct

7160f7f

support-Qwen2VLVisionConfig

5c8d171

remove-standardize_cache_format

7e57d56

fix-letter-varaibles

009f637

remove-torch-in-image-processor

a7a1f6f

remove-useless-docstring

6fa58f5

Merge branch 'main' of https://github.com/simonJJJ/transformers into …

6d3d580

…qwen2_vl

fix-one-letter-varaible-name

849621d

change-block-name

bd806ff

simonJJJ added 2 commits August 22, 2024 21:00

more-docs-for-mrope

704e3f2

polish-doc

eefb67a

simonJJJ requested a review from ArthurZucker August 22, 2024 13:05

simonJJJ added 2 commits August 22, 2024 22:54

comment-and-link

1fe8570

polish-doc

827e5e9

ArthurZucker approved these changes Aug 23, 2024

View reviewed changes

simonJJJ added 6 commits August 26, 2024 14:10

single-letter-variables

f32ac01

simplify-image-processing

e65e7f8

video->images

5d37d76

kv_seq_len-update

4752328

vision-rope-on-the-fly

36f2d43

vision-eager-attention

3ef1657

change-processor-order

e28cc19

ArthurZucker reviewed Aug 26, 2024

View reviewed changes

ArthurZucker merged commit 19e6e80 into huggingface:main Aug 26, 2024
23 of 25 checks passed

fyabc mentioned this pull request Aug 27, 2024

[Model][VLM] Add Qwen2-VL model support vllm-project/vllm#7905

Merged

simonJJJ mentioned this pull request Aug 28, 2024

support qwen2-vl hiyouga/LLaMA-Factory#5290

Merged

2 tasks

fyabc mentioned this pull request Aug 31, 2024

vllm部署qwen2-vl-7b的时候启动错误 QwenLM/Qwen2.5-VL#35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support qwen2-vl #32318

support qwen2-vl #32318

simonJJJ commented Jul 30, 2024

ArthurZucker left a comment

zucchini-nlp left a comment

simonJJJ commented Aug 1, 2024

zucchini-nlp left a comment

ArthurZucker left a comment

ArthurZucker Aug 23, 2024

simonJJJ Aug 26, 2024

ArthurZucker Aug 23, 2024

simonJJJ Aug 26, 2024

ArthurZucker Aug 23, 2024

simonJJJ Aug 26, 2024

ArthurZucker Aug 23, 2024

ShuaiBai623 Aug 26, 2024

ArthurZucker Aug 26, 2024

truthstriver Oct 16, 2024

zucchini-nlp Oct 16, 2024

simonJJJ commented Aug 26, 2024

zucchini-nlp commented Aug 26, 2024

simonJJJ commented Aug 26, 2024

zucchini-nlp commented Aug 26, 2024

simonJJJ commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

ArthurZucker left a comment

ArthurZucker Aug 26, 2024

ArthurZucker commented Aug 26, 2024

simonJJJ commented Aug 26, 2024

		for i in range(1, len(cu_seqlens)):
		attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = True

support qwen2-vl #32318

support qwen2-vl #32318

Conversation

simonJJJ commented Jul 30, 2024

What does this PR do?

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

simonJJJ commented Aug 1, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonJJJ commented Aug 26, 2024

zucchini-nlp commented Aug 26, 2024

simonJJJ commented Aug 26, 2024

zucchini-nlp commented Aug 26, 2024

simonJJJ commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Aug 26, 2024

simonJJJ commented Aug 26, 2024