add self.head_dim for VisionAttention in Qwen2-VL #33211

GeLee-Q · 2024-08-30T09:53:21Z

Add self.head_dim for VisionAttention in Qwen2-VL

This PR adds the self.head_dim attribute to the VisionAttention class in the Qwen2-VL model. This addition is necessary for proper dimension calculations in the attention mechanism of the vision component.

Changes made

Added self.head_dim attribute to the VisionAttention class
Initialized self.head_dim with the appropriate value

Motivation

The head_dim attribute is crucial for calculating attention scores and outputs correctly. Its addition ensures that the vision attention mechanism in Qwen2-VL operates as intended, maintaining consistency with the model's architecture.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ✅] Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[ ✅]] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@simonJJJ @zucchini-nlp @ArthurZucker

ArthurZucker · 2024-09-02T07:42:34Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

@@ -275,6 +275,7 @@ class VisionAttention(nn.Module):
    def __init__(self, dim: int, num_heads: int = 16) -> None:
        super().__init__()
        self.num_heads = num_heads
+        self.head_dim = dim // num_heads


good, I am a bit baffled as to how this was not caught, the math.sqrt could not have run 😅

We have few failing tests: https://github.com/huggingface/transformers/actions/runs/10656977518/job/29536379001#step:13:694 but this was not caught.

@require_bitsandbytes def test_small_model_integration_test_batch_different_resolutions(self): model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", load_in_4bit=True) > text, vision_infos = self.processor.apply_chat_template( self.messages, tokenize=False, add_generation_prompt=True ) E ValueError: too many values to unpack (expected 2)

this one needs to be updated

@ArthurZucker Hello, I found the code related to vision_infos in the file vision_process.py on QwenLM. However, the Qwen2-VL processor in tranformers does not have an interface to process vision_info. Therefore, I added a function to process this information.

ArthurZucker

Thanks, were you able to run all the tests for this? 🤗

ArthurZucker · 2024-09-02T14:39:08Z

tests/models/qwen2_vl/test_modeling_qwen2_vl.py

+        vision_infos = self.extract_vision_info(messages2)
+        image_url = vision_infos[0]["image"]
+        image_input2 = Image.open(requests.get(image_url, stream=True).raw)


I think that the processor is supposed to be able to handle urls or images and properly open them, would make sense to add this if it's not currently the case for an eased usage no? 🤗

Yes, the url/path can be handled and it is currently handled in idefics-1. But imo the idefics-1 design is a bit ugly and is an issue for pipelines, we'd need a better way to handle those.

The original PR for adding QwenVL had a pretty nice chat template yet I didn't want to add the extract_vision_info yet. At least before making sure it's something we can maintain easily for most VLMs

GeLee-Q · 2024-09-02T15:32:10Z

Thanks, were you able to run all the tests for this? 🤗

By commenting out @slow and @require_bitsandbytes, I ran test_small_model_integration_test_batch_wo_image locally. However, since I used the qwen2-vl-2b model, it caused an OOM error on an A800 single-card machine. I can ensure the code's correctness before output = model.generate(**inputs, max_new_tokens=30). Tomorrow, I will test again using Qwen2-VL-2B-Instruct-GPTQ-Int4.

GeLee-Q · 2024-09-03T05:17:27Z

Thanks, were you able to run all the tests for this? 🤗

@ArthurZucker Hi! I ran these tests and encountered the following issues. Some test results show minor precision deviations. The unit test code is fine, but the model may need further precision alignment. For batch inference tests, I encountered OOM issues, which require manual image resizing. I personally resized the images to [256, 256]. And if there are no further issues, could you please merge my code?

Qwen2VLModelTest.test_batching_equivalence

transformers/tests/test_modeling_common.py:735: in recursive_check
    self.assertTrue(
E   AssertionError: tensor(False, device='cuda:0') is not true : Batched and Single row outputs are not equal in Qwen2VLForConditionalGeneration 
for key=logits. Difference=0.0031325221061706543.

Qwen2VLIntegrationTest.test_small_model_integration_test

// for pic
assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=1e-3)
E       assert False
E        +  where False = <built-in method allclose of type object at 0x7f766a7242e0>(tensor([[0.8501, 0.8647, 0.8647],\n        [1.0106, 1.0106, 1.0252],\n        [0.9960, 1.0106, 1.0252],\n        [1.0982, 1.1128, 1.1274],\n        [1.0836, 1.0982, 1.0982],\n        [1.1858, 1.1858, 1.1858]]), tensor([[0.8501, 0.8501, 0.8647],\n        [0.9376, 0.9376, 0.9376],\n        [0.9084, 0.9376, 0.9376],\n        [1.0252, 1.0252, 1.0544],\n        [1.0252, 1.0252, 1.0252],\n        [1.0836, 1.0836, 1.0836]]), atol=0.001)
E        +    where <built-in method allclose of type object at 0x7f766a7242e0> = torch.allclose

// for llm output
E       AssertionError: 'syst[60 chars]this?\nassistant\nThe dog in the picture appea[117 chars]ices' != 'syst[60 chars]this?assistant\nThe dog in the picture appears[106 chars]ure,'
E       Diff is 686 characters long. Set self.maxDiff to None to see it.

Qwen2VLIntegrationTest.test_small_model_integration_test_batch_wo_image

EXPECTED_DECODED_TEXT = [
"system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?assistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and outgoing personalities, as well as their",
"system\nYou are a helpful assistant.user\nWho are you?assistant\nI am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with various tasks and answer a wide range of questions to",
]
output_text :['system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.']

E       AssertionError: Lists differ: ['sys[61 chars]this?\nassistant\nThe dog in the picture appea[262 chars]en.'] != ['sys[61 chars]this?assistant\nThe dog in the picture appears[320 chars] to']
E       
E       First differing element 0:
E       'syst[60 chars]this?\nassistant\nThe dog in the picture appea[117 chars]ices'
E       'syst[60 chars]this?assistant\nThe dog in the picture appears[108 chars]heir'
E       
E       Diff is 1060 characters long. Set self.maxDiff to None to see it.

Qwen2VLIntegrationTest.test_small_model_integration_test_batch

By cropping the images, the OOM (Out of Memory) issue has been resolved.

E       AssertionError: Lists differ: ['sys[61 chars]this?\nassistant\nThis is a golden retriever.'[225 chars]ith'] != ['sys[61 chars]this?assistant\nThe dog in the picture appears[326 chars]re,']
E       
E       First differing element 0:
E       'syst[60 chars]this?\nassistant\nThis is a golden retriever.'
E       'syst[60 chars]this?assistant\nThe dog in the picture appears[106 chars]ure,'
E       
E       Diff is 879 characters long. Set self.maxDiff to None to see it.

Qwen2VLIntegrationTest.test_small_model_integration_test_batch_different_resolutions

By cropping the images, the OOM (Out of Memory) issue has been resolved.

E       AssertionError: Lists differ: ['sys[61 chars]this?\nassistant\nThe dog in the picture appea[149 chars]dor'] != ['sys[61 chars]this?assistant\nThe dog in the picture appears[249 chars]en.']
E       
E       First differing element 0:
E       'syst[60 chars]this?\nassistant\nThe dog in the picture appea[15 chars]ador'
E       'syst[60 chars]this?assistant\nThe dog in the picture appears[106 chars]ure,'
E       
E       Diff is 806 characters long. Set self.maxDiff to None to see it.

GeLee-Q · 2024-09-04T07:04:18Z

@ArthurZucker @zucchini-nlp Hello, I was wondering if this PR could be merged. Based on this code, I discovered a precision issue with fp16 inference. I've temporarily resolved this issue by modifying the source code, and I will create a separate PR to address this problem.

ydshieh · 2024-09-04T12:47:56Z

Hi!

Could you push an empty commit with message

[run-slow] qwen2_vl

Thanks!

HuggingFaceDocBuilderDev · 2024-09-04T13:06:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

GeLee-Q · 2024-09-04T13:16:35Z

Hi!

Could you push an empty commit with message

[run-slow] qwen2_vl

Thanks!

Okay, I've submitted it. Is there anything else that needs to be done?

ydshieh · 2024-09-04T13:24:54Z

Okay, I've submitted it. Is there any

Not at this moment :-) Have to wait CI's results.

GeLee-Q · 2024-09-04T13:59:48Z

Okay, I've submitted it. Is there any

Not at this moment :-) Have to wait CI's results.

@ydshieh hello, During CI, I encountered an error related to Python versioning. The list[dict] syntax is only supported in Python 3.9+. This code originates from the Qwen team's source code, do you think this needs to be modified?

==================================== ERRORS ====================================
_______ ERROR collecting tests/models/qwen2_vl/test_modeling_qwen2_vl.py _______
tests/models/qwen2_vl/test_modeling_qwen2_vl.py:[30](https://github.com/huggingface/transformers/actions/runs/10702431916/job/29671213250?pr=33211#step:11:31)4: in <module>
    class Qwen2VLIntegrationTest(unittest.TestCase):
tests/models/qwen2_vl/test_modeling_qwen2_vl.py:459: in Qwen2VLIntegrationTest
    def extract_vision_info(self, conversations: list[dict] | list[list[dict]]) -> list[dict]:
E   TypeError: 'type' object is not subscriptable
=========================== short test summary info ============================
ERROR tests/models/qwen2_vl/test_modeling_qwen2_vl.py - TypeError: 'type' object is not subscriptable
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 2.28s ===============================

ydshieh · 2024-09-04T14:14:44Z

Great we catch it here!

Yes, we have to modify it as we are still running with python 3.8. (Will change in 2 months).

Could you try something like

List[Dict]

List[List[Dict]]

where they are imported like

from typing import Dict, List

(you can search the codebase to see some such usages)

ydshieh · 2024-09-04T14:15:44Z

You probably also needs Union

GeLee-Q · 2024-09-04T14:51:22Z

You probably also needs Union

@ydshieh Thank you, I have adapted the code to be compatible with Python 3.8.

GeLee-Q · 2024-09-05T13:41:33Z

You probably also needs Union

@ydshieh Hi，could you please start the Action when it's convenient for you?

GeLee-Q · 2024-09-05T15:03:00Z

Okay, I've submitted it. Is there any

Not at this moment :-) Have to wait CI's results.

@ydshieh Hi, The CI results are out. What else needs to be done to merge the code? Do I need to submit an empty commit with the message "[run-slow] qwen2_vl" again?

ydshieh · 2024-09-05T15:23:58Z

oh, yes. You need a commit of that message.

as you can see, so far it is

PR slow CI / Run all tests for the model (pull_request) Skipped

GeLee-Q · 2024-09-05T15:40:50Z

oh, yes. You need a commit of that message.

as you can see, so far it is

PR slow CI / Run all tests for the model (pull_request) Skipped

Okay, thanks

zucchini-nlp · 2024-09-05T15:40:57Z

We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase main to get a green CI :)

GeLee-Q · 2024-09-05T15:45:54Z

We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase main to get a green CI :)

Okay, thank you. I've seen their modifications and commits. The test code has resolved the precision and OOM (Out of Memory) issues. My modifications to the test code should be unnecessary now.

GeLee-Q · 2024-09-05T16:52:03Z

We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase main to get a green CI :)

@zucchini-nlp Hello, I've modified and resolved the conflicting code. Could you please trigger the action and merge it when convenient? Thank you!

zucchini-nlp

Thanks for fixing! Will merge shortly, can you rebase main please

ydshieh · 2024-09-06T08:23:38Z

@zucchini-nlp

Don't forget to request a commit to trigger (and approve the run) the slow CI before merge 🙏

zucchini-nlp · 2024-09-06T08:39:39Z

Yep, sure. @GeLee-Q can you tag me when you're done rebasing/testing and add the last commit with [run-slow] qwen2_vl message, I'll approve slow CI run

GeLee-Q · 2024-09-06T08:57:39Z

@zucchini-nlp Hi ! I've completed the rebase and pushed the [run-slow] commit for qwen2_vl. The branch is ready for your review and approval for the slow CI run. Let me know if you need anything else.

zucchini-nlp · 2024-09-06T10:03:53Z

Tests for sdpa are failing on multi-gpu setting but from the logs seems like the diff is around 1e-03. The error doesn't seem to be caused by this PR and is passing on single-gpu. I think we can merge, what do you say @ydshieh?

ydshieh · 2024-09-06T10:27:11Z

I am OK with it. But let me check with the main branch on this test first.

Come back here for an update later.

ydshieh · 2024-09-06T11:10:50Z

I am running on 1759bb9126e59405f58693a17ef9f58040c2008b (main) which is the base of this PR.

The test is passing there.

https://github.com/huggingface/transformers/actions/runs/10737126379/job/29777960925

Not sure if it's flaky. We can re-trigger CI here.

zucchini-nlp · 2024-09-06T11:25:27Z

@ydshieh yes, apparently it is passing and the CI is green now. Do you want me to retrigger CI again?

ydshieh · 2024-09-06T12:12:46Z

No, in this case, merge is fine :-)

* add self.head_dim for VisionAttention in Qwen2-VL * add self.head_dim for VisionAttention in Qwen2-VL * fix ci * black the test_modeling_qwen2_vl.py * use ruff to format test_modeling_qwen2_vl.py * [run-slow] qwen2_vl * use tying for python3.8 * fix the import format * use ruff to fix the ci error I001 * [run-slow] qwen2_vl * remove unused import * commit for rebase * use ruff fix ci * [run-slow] qwen2_vl --------- Co-authored-by: root <liji>

ArthurZucker reviewed Sep 2, 2024

View reviewed changes

ArthurZucker approved these changes Sep 2, 2024

View reviewed changes

GeLee-Q mentioned this pull request Sep 4, 2024

"Qwen2-VL FP16 inference results in errors or gibberish output." #33294

Closed

4 tasks

ydshieh added the run-slow label Sep 4, 2024

zucchini-nlp approved these changes Sep 6, 2024

View reviewed changes

add self.head_dim for VisionAttention in Qwen2-VL

81e26c5

GeLee-Q and others added 5 commits September 6, 2024 16:24

add self.head_dim for VisionAttention in Qwen2-VL

51e601b

fix ci

7b57e90

black the test_modeling_qwen2_vl.py

774e6d5

use ruff to format test_modeling_qwen2_vl.py

93837d2

[run-slow] qwen2_vl

35e321b

GeLee-Q added 6 commits September 6, 2024 16:29

use tying for python3.8

14b4b2b

fix the import format

7074554

use ruff to fix the ci error I001

0000ee0

[run-slow] qwen2_vl

811d745

remove unused import

81cfb00

commit for rebase

48fc376

GeLee-Q force-pushed the fix-a-bug-for-qwen2vl branch from 1bfdbb7 to 48fc376 Compare September 6, 2024 08:37

GeLee-Q added 2 commits September 6, 2024 16:41

use ruff fix ci

61f5be4

[run-slow] qwen2_vl

e366f40

zucchini-nlp merged commit 2b18354 into huggingface:main Sep 6, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add self.head_dim for VisionAttention in Qwen2-VL #33211

add self.head_dim for VisionAttention in Qwen2-VL #33211

GeLee-Q commented Aug 30, 2024

ArthurZucker Sep 2, 2024

ArthurZucker Sep 2, 2024

GeLee-Q Sep 2, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Sep 2, 2024

zucchini-nlp Sep 2, 2024

GeLee-Q commented Sep 2, 2024

GeLee-Q commented Sep 3, 2024 •

edited

Loading

GeLee-Q commented Sep 4, 2024

ydshieh commented Sep 4, 2024

HuggingFaceDocBuilderDev commented Sep 4, 2024

GeLee-Q commented Sep 4, 2024

ydshieh commented Sep 4, 2024

GeLee-Q commented Sep 4, 2024

ydshieh commented Sep 4, 2024

ydshieh commented Sep 4, 2024

GeLee-Q commented Sep 4, 2024

GeLee-Q commented Sep 5, 2024 •

edited

Loading

GeLee-Q commented Sep 5, 2024

ydshieh commented Sep 5, 2024

GeLee-Q commented Sep 5, 2024

zucchini-nlp commented Sep 5, 2024

GeLee-Q commented Sep 5, 2024

GeLee-Q commented Sep 5, 2024

zucchini-nlp left a comment •

edited

Loading

ydshieh commented Sep 6, 2024 •

edited

Loading

zucchini-nlp commented Sep 6, 2024

GeLee-Q commented Sep 6, 2024

zucchini-nlp commented Sep 6, 2024

ydshieh commented Sep 6, 2024

ydshieh commented Sep 6, 2024

zucchini-nlp commented Sep 6, 2024

ydshieh commented Sep 6, 2024

add self.head_dim for VisionAttention in Qwen2-VL #33211

add self.head_dim for VisionAttention in Qwen2-VL #33211

Conversation

GeLee-Q commented Aug 30, 2024