-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add self.head_dim for VisionAttention in Qwen2-VL #33211
Conversation
@@ -275,6 +275,7 @@ class VisionAttention(nn.Module): | |||
def __init__(self, dim: int, num_heads: int = 16) -> None: | |||
super().__init__() | |||
self.num_heads = num_heads | |||
self.head_dim = dim // num_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, I am a bit baffled as to how this was not caught, the math.sqrt could not have run 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have few failing tests: https://github.com/huggingface/transformers/actions/runs/10656977518/job/29536379001#step:13:694 but this was not caught.
@require_bitsandbytes
def test_small_model_integration_test_batch_different_resolutions(self):
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", load_in_4bit=True)
> text, vision_infos = self.processor.apply_chat_template(
self.messages, tokenize=False, add_generation_prompt=True
)
E ValueError: too many values to unpack (expected 2)
this one needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker Hello, I found the code related to vision_infos in the file vision_process.py on QwenLM. However, the Qwen2-VL processor in tranformers does not have an interface to process vision_info. Therefore, I added a function to process this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, were you able to run all the tests for this? 🤗
vision_infos = self.extract_vision_info(messages2) | ||
image_url = vision_infos[0]["image"] | ||
image_input2 = Image.open(requests.get(image_url, stream=True).raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the processor is supposed to be able to handle urls or images and properly open them, would make sense to add this if it's not currently the case for an eased usage no? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the url/path can be handled and it is currently handled in idefics-1. But imo the idefics-1 design is a bit ugly and is an issue for pipelines, we'd need a better way to handle those.
The original PR for adding QwenVL had a pretty nice chat template yet I didn't want to add the extract_vision_info
yet. At least before making sure it's something we can maintain easily for most VLMs
By commenting out @slow and @require_bitsandbytes, I ran test_small_model_integration_test_batch_wo_image locally. However, since I used the qwen2-vl-2b model, it caused an OOM error on an A800 single-card machine. I can ensure the code's correctness before |
@ArthurZucker Hi! I ran these tests and encountered the following issues. Some test results show minor precision deviations. The unit test code is fine, but the model may need further precision alignment. For batch inference tests, I encountered OOM issues, which require manual image resizing. I personally resized the images to [256, 256]. And if there are no further issues, could you please merge my code? Qwen2VLModelTest.test_batching_equivalence
Qwen2VLIntegrationTest.test_small_model_integration_test
Qwen2VLIntegrationTest.test_small_model_integration_test_batch_wo_image
Qwen2VLIntegrationTest.test_small_model_integration_test_batchBy cropping the images, the OOM (Out of Memory) issue has been resolved.
Qwen2VLIntegrationTest.test_small_model_integration_test_batch_different_resolutionsBy cropping the images, the OOM (Out of Memory) issue has been resolved.
|
@ArthurZucker @zucchini-nlp Hello, I was wondering if this PR could be merged. Based on this code, I discovered a precision issue with fp16 inference. I've temporarily resolved this issue by modifying the source code, and I will create a separate PR to address this problem. |
Hi! Could you push an empty commit with message
Thanks! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Okay, I've submitted it. Is there anything else that needs to be done? |
Not at this moment :-) Have to wait CI's results. |
@ydshieh hello, During CI, I encountered an error related to Python versioning. The list[dict] syntax is only supported in Python 3.9+. This code originates from the Qwen team's source code, do you think this needs to be modified?
|
Great we catch it here! Yes, we have to modify it as we are still running with python 3.8. (Will change in 2 months). Could you try something like
where they are imported like
(you can search the codebase to see some such usages) |
You probably also needs |
@ydshieh Thank you, I have adapted the code to be compatible with Python 3.8. |
@ydshieh Hi,could you please start the Action when it's convenient for you? |
@ydshieh Hi, The CI results are out. What else needs to be done to merge the code? Do I need to submit an empty commit with the message "[run-slow] qwen2_vl" again? |
oh, yes. You need a commit of that message. as you can see, so far it is
|
Okay, thanks |
We'll merge #33161 soon, which should add more tests and fix slow CI. You can then rebase |
Okay, thank you. I've seen their modifications and commits. The test code has resolved the precision and OOM (Out of Memory) issues. My modifications to the test code should be unnecessary now. |
@zucchini-nlp Hello, I've modified and resolved the conflicting code. Could you please trigger the action and merge it when convenient? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing! Will merge shortly, can you rebase main please
Don't forget to request a commit to trigger (and approve the run) the slow CI before merge 🙏 |
1bfdbb7
to
48fc376
Compare
Yep, sure. @GeLee-Q can you tag me when you're done rebasing/testing and add the last commit with |
@zucchini-nlp Hi ! I've completed the rebase and pushed the [run-slow] commit for qwen2_vl. The branch is ready for your review and approval for the slow CI run. Let me know if you need anything else. |
Tests for sdpa are failing on multi-gpu setting but from the logs seems like the diff is around |
I am OK with it. But let me check with the main branch on this test first. Come back here for an update later. |
I am running on The test is passing there. https://github.com/huggingface/transformers/actions/runs/10737126379/job/29777960925 Not sure if it's flaky. We can re-trigger CI here. |
@ydshieh yes, apparently it is passing and the CI is green now. Do you want me to retrigger CI again? |
No, in this case, merge is fine :-) |
* add self.head_dim for VisionAttention in Qwen2-VL * add self.head_dim for VisionAttention in Qwen2-VL * fix ci * black the test_modeling_qwen2_vl.py * use ruff to format test_modeling_qwen2_vl.py * [run-slow] qwen2_vl * use tying for python3.8 * fix the import format * use ruff to fix the ci error I001 * [run-slow] qwen2_vl * remove unused import * commit for rebase * use ruff fix ci * [run-slow] qwen2_vl --------- Co-authored-by: root <liji>
Add self.head_dim for VisionAttention in Qwen2-VL
This PR adds the
self.head_dim
attribute to the VisionAttention class in the Qwen2-VL model. This addition is necessary for proper dimension calculations in the attention mechanism of the vision component.Changes made
self.head_dim
attribute to the VisionAttention classself.head_dim
with the appropriate valueMotivation
The
head_dim
attribute is crucial for calculating attention scores and outputs correctly. Its addition ensures that the vision attention mechanism in Qwen2-VL operates as intended, maintaining consistency with the model's architecture.Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@simonJJJ @zucchini-nlp @ArthurZucker