Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add self.head_dim for VisionAttention in Qwen2-VL #33211

Merged
merged 14 commits into from
Sep 6, 2024

Conversation

GeLee-Q
Copy link
Contributor

@GeLee-Q GeLee-Q commented Aug 30, 2024

Add self.head_dim for VisionAttention in Qwen2-VL

This PR adds the self.head_dim attribute to the VisionAttention class in the Qwen2-VL model. This addition is necessary for proper dimension calculations in the attention mechanism of the vision component.

Changes made

  • Added self.head_dim attribute to the VisionAttention class
  • Initialized self.head_dim with the appropriate value

Motivation

The head_dim attribute is crucial for calculating attention scores and outputs correctly. Its addition ensures that the vision attention mechanism in Qwen2-VL operates as intended, maintaining consistency with the model's architecture.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ✅] Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • [ ✅]] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@simonJJJ @zucchini-nlp @ArthurZucker

@@ -275,6 +275,7 @@ class VisionAttention(nn.Module):
def __init__(self, dim: int, num_heads: int = 16) -> None:
super().__init__()
self.num_heads = num_heads
self.head_dim = dim // num_heads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, I am a bit baffled as to how this was not caught, the math.sqrt could not have run 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have few failing tests: https://github.com/huggingface/transformers/actions/runs/10656977518/job/29536379001#step:13:694 but this was not caught.

     @require_bitsandbytes
    def test_small_model_integration_test_batch_different_resolutions(self):
        model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", load_in_4bit=True)
>       text, vision_infos = self.processor.apply_chat_template(
            self.messages, tokenize=False, add_generation_prompt=True
        )
E       ValueError: too many values to unpack (expected 2)

this one needs to be updated

Copy link
Contributor Author

@GeLee-Q GeLee-Q Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker Hello, I found the code related to vision_infos in the file vision_process.py on QwenLM. However, the Qwen2-VL processor in tranformers does not have an interface to process vision_info. Therefore, I added a function to process this information.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, were you able to run all the tests for this? 🤗

Comment on lines 434 to 436
vision_infos = self.extract_vision_info(messages2)
image_url = vision_infos[0]["image"]
image_input2 = Image.open(requests.get(image_url, stream=True).raw)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the processor is supposed to be able to handle urls or images and properly open them, would make sense to add this if it's not currently the case for an eased usage no? 🤗

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the url/path can be handled and it is currently handled in idefics-1. But imo the idefics-1 design is a bit ugly and is an issue for pipelines, we'd need a better way to handle those.

The original PR for adding QwenVL had a pretty nice chat template yet I didn't want to add the extract_vision_info yet. At least before making sure it's something we can maintain easily for most VLMs

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 2, 2024

Thanks, were you able to run all the tests for this? 🤗

By commenting out @slow and @require_bitsandbytes, I ran test_small_model_integration_test_batch_wo_image locally. However, since I used the qwen2-vl-2b model, it caused an OOM error on an A800 single-card machine. I can ensure the code's correctness before output = model.generate(**inputs, max_new_tokens=30). Tomorrow, I will test again using Qwen2-VL-2B-Instruct-GPTQ-Int4.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 3, 2024

Thanks, were you able to run all the tests for this? 🤗

@ArthurZucker Hi! I ran these tests and encountered the following issues. Some test results show minor precision deviations. The unit test code is fine, but the model may need further precision alignment. For batch inference tests, I encountered OOM issues, which require manual image resizing. I personally resized the images to [256, 256]. And if there are no further issues, could you please merge my code?

Qwen2VLModelTest.test_batching_equivalence

transformers/tests/test_modeling_common.py:735: in recursive_check
    self.assertTrue(
E   AssertionError: tensor(False, device='cuda:0') is not true : Batched and Single row outputs are not equal in Qwen2VLForConditionalGeneration 
for key=logits. Difference=0.0031325221061706543.

Qwen2VLIntegrationTest.test_small_model_integration_test

// for pic
assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=1e-3)
E       assert False
E        +  where False = <built-in method allclose of type object at 0x7f766a7242e0>(tensor([[0.8501, 0.8647, 0.8647],\n        [1.0106, 1.0106, 1.0252],\n        [0.9960, 1.0106, 1.0252],\n        [1.0982, 1.1128, 1.1274],\n        [1.0836, 1.0982, 1.0982],\n        [1.1858, 1.1858, 1.1858]]), tensor([[0.8501, 0.8501, 0.8647],\n        [0.9376, 0.9376, 0.9376],\n        [0.9084, 0.9376, 0.9376],\n        [1.0252, 1.0252, 1.0544],\n        [1.0252, 1.0252, 1.0252],\n        [1.0836, 1.0836, 1.0836]]), atol=0.001)
E        +    where <built-in method allclose of type object at 0x7f766a7242e0> = torch.allclose

// for llm output
E       AssertionError: 'syst[60 chars]this?\nassistant\nThe dog in the picture appea[117 chars]ices' != 'syst[60 chars]this?assistant\nThe dog in the picture appears[106 chars]ure,'
E       Diff is 686 characters long. Set self.maxDiff to None to see it.

Qwen2VLIntegrationTest.test_small_model_integration_test_batch_wo_image

EXPECTED_DECODED_TEXT = [
"system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?assistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and outgoing personalities, as well as their",
"system\nYou are a helpful assistant.user\nWho are you?assistant\nI am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with various tasks and answer a wide range of questions to",
]
output_text :['system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.']

E       AssertionError: Lists differ: ['sys[61 chars]this?\nassistant\nThe dog in the picture appea[262 chars]en.'] != ['sys[61 chars]this?assistant\nThe dog in the picture appears[320 chars] to']
E       
E       First differing element 0:
E       'syst[60 chars]this?\nassistant\nThe dog in the picture appea[117 chars]ices'
E       'syst[60 chars]this?assistant\nThe dog in the picture appears[108 chars]heir'
E       
E       Diff is 1060 characters long. Set self.maxDiff to None to see it.

Qwen2VLIntegrationTest.test_small_model_integration_test_batch

By cropping the images, the OOM (Out of Memory) issue has been resolved.

E       AssertionError: Lists differ: ['sys[61 chars]this?\nassistant\nThis is a golden retriever.'[225 chars]ith'] != ['sys[61 chars]this?assistant\nThe dog in the picture appears[326 chars]re,']
E       
E       First differing element 0:
E       'syst[60 chars]this?\nassistant\nThis is a golden retriever.'
E       'syst[60 chars]this?assistant\nThe dog in the picture appears[106 chars]ure,'
E       
E       Diff is 879 characters long. Set self.maxDiff to None to see it.

Qwen2VLIntegrationTest.test_small_model_integration_test_batch_different_resolutions

By cropping the images, the OOM (Out of Memory) issue has been resolved.

E       AssertionError: Lists differ: ['sys[61 chars]this?\nassistant\nThe dog in the picture appea[149 chars]dor'] != ['sys[61 chars]this?assistant\nThe dog in the picture appears[249 chars]en.']
E       
E       First differing element 0:
E       'syst[60 chars]this?\nassistant\nThe dog in the picture appea[15 chars]ador'
E       'syst[60 chars]this?assistant\nThe dog in the picture appears[106 chars]ure,'
E       
E       Diff is 806 characters long. Set self.maxDiff to None to see it.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 4, 2024

@ArthurZucker @zucchini-nlp Hello, I was wondering if this PR could be merged. Based on this code, I discovered a precision issue with fp16 inference. I've temporarily resolved this issue by modifying the source code, and I will create a separate PR to address this problem.

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 4, 2024

Hi!

Could you push an empty commit with message

[run-slow] qwen2_vl

Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 4, 2024

Hi!

Could you push an empty commit with message

[run-slow] qwen2_vl

Thanks!

Okay, I've submitted it. Is there anything else that needs to be done?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 4, 2024

Okay, I've submitted it. Is there any

Not at this moment :-) Have to wait CI's results.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 4, 2024

Okay, I've submitted it. Is there any

Not at this moment :-) Have to wait CI's results.

@ydshieh hello, During CI, I encountered an error related to Python versioning. The list[dict] syntax is only supported in Python 3.9+. This code originates from the Qwen team's source code, do you think this needs to be modified?

==================================== ERRORS ====================================
_______ ERROR collecting tests/models/qwen2_vl/test_modeling_qwen2_vl.py _______
tests/models/qwen2_vl/test_modeling_qwen2_vl.py:[30](https://github.com/huggingface/transformers/actions/runs/10702431916/job/29671213250?pr=33211#step:11:31)4: in <module>
    class Qwen2VLIntegrationTest(unittest.TestCase):
tests/models/qwen2_vl/test_modeling_qwen2_vl.py:459: in Qwen2VLIntegrationTest
    def extract_vision_info(self, conversations: list[dict] | list[list[dict]]) -> list[dict]:
E   TypeError: 'type' object is not subscriptable
=========================== short test summary info ============================
ERROR tests/models/qwen2_vl/test_modeling_qwen2_vl.py - TypeError: 'type' object is not subscriptable
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 2.28s ===============================

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 4, 2024

Great we catch it here!

Yes, we have to modify it as we are still running with python 3.8. (Will change in 2 months).

Could you try something like

List[Dict]

List[List[Dict]]

where they are imported like

from typing import Dict, List

(you can search the codebase to see some such usages)

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 4, 2024

You probably also needs Union

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 4, 2024

You probably also needs Union

@ydshieh Thank you, I have adapted the code to be compatible with Python 3.8.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 5, 2024

You probably also needs Union

@ydshieh Hi,could you please start the Action when it's convenient for you?

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 5, 2024

Okay, I've submitted it. Is there any

Not at this moment :-) Have to wait CI's results.

@ydshieh Hi, The CI results are out. What else needs to be done to merge the code? Do I need to submit an empty commit with the message "[run-slow] qwen2_vl" again?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 5, 2024

oh, yes. You need a commit of that message.

as you can see, so far it is

PR slow CI / Run all tests for the model (pull_request) Skipped

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 5, 2024

oh, yes. You need a commit of that message.

as you can see, so far it is

PR slow CI / Run all tests for the model (pull_request) Skipped

Okay, thanks

@zucchini-nlp
Copy link
Member

We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase main to get a green CI :)

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 5, 2024

We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase main to get a green CI :)

Okay, thank you. I've seen their modifications and commits. The test code has resolved the precision and OOM (Out of Memory) issues. My modifications to the test code should be unnecessary now.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 5, 2024

We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase main to get a green CI :)

@zucchini-nlp Hello, I've modified and resolved the conflicting code. Could you please trigger the action and merge it when convenient? Thank you!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! Will merge shortly, can you rebase main please

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 6, 2024

@zucchini-nlp

Don't forget to request a commit to trigger (and approve the run) the slow CI before merge 🙏

@GeLee-Q GeLee-Q force-pushed the fix-a-bug-for-qwen2vl branch from 1bfdbb7 to 48fc376 Compare September 6, 2024 08:37
@zucchini-nlp
Copy link
Member

Yep, sure. @GeLee-Q can you tag me when you're done rebasing/testing and add the last commit with [run-slow] qwen2_vl message, I'll approve slow CI run

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 6, 2024

@zucchini-nlp Hi ! I've completed the rebase and pushed the [run-slow] commit for qwen2_vl. The branch is ready for your review and approval for the slow CI run. Let me know if you need anything else.

@zucchini-nlp
Copy link
Member

Tests for sdpa are failing on multi-gpu setting but from the logs seems like the diff is around 1e-03. The error doesn't seem to be caused by this PR and is passing on single-gpu. I think we can merge, what do you say @ydshieh?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 6, 2024

I am OK with it. But let me check with the main branch on this test first.

Come back here for an update later.

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 6, 2024

I am running on 1759bb9126e59405f58693a17ef9f58040c2008b (main) which is the base of this PR.

The test is passing there.

https://github.com/huggingface/transformers/actions/runs/10737126379/job/29777960925

Not sure if it's flaky. We can re-trigger CI here.

@zucchini-nlp
Copy link
Member

@ydshieh yes, apparently it is passing and the CI is green now. Do you want me to retrigger CI again?

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 6, 2024

No, in this case, merge is fine :-)

@zucchini-nlp zucchini-nlp merged commit 2b18354 into huggingface:main Sep 6, 2024
17 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* add self.head_dim for VisionAttention in Qwen2-VL

* add self.head_dim for VisionAttention in Qwen2-VL

* fix ci

* black the test_modeling_qwen2_vl.py

* use ruff to format test_modeling_qwen2_vl.py

* [run-slow] qwen2_vl

* use tying for python3.8

* fix the import format

* use ruff to fix the ci error I001

* [run-slow] qwen2_vl

* remove unused import

* commit for rebase

* use ruff fix ci

* [run-slow] qwen2_vl

---------

Co-authored-by: root <liji>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants