-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix qwen2vl float16 inference bug #33312
Fix qwen2vl float16 inference bug #33312
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! This indeed looks super useful, my 2 main concerns:
- speed: at inference in fp16 this will be slower
Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well
Are you suggesting that adding this operation will cause the inference speed to slow down in FP16? But if the precision is off, doesn't speed seem less important? I will test the full process inference speed tomorrow to compare. Additionally, regarding llama, I will test it using float16. This is a very interesting phenomenon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!
Alright, I'll provide some data tomorrow. |
@ArthurZucker Hello, my slow test has been completed. This operation incurs some overhead, but it's not significant. It accounts for 3.51% of the total CPU time and 0.6% of the total CUDA time across all operators. Trading this small overhead for improved accuracy is, in my opinion, acceptable overall. Comparison results
![]() profile
This modification added three operations: aten::where, aten::isinf, and aten::zeros_like.
These three operations collectively consume approximately 3.51% of the total CPU time. From the nysys results, we can see that aten::where accounts for 0.6% of the total overhead among all operators. torch profile
nysys profile
test code
change dtype
|
Can you provide an example of an issue with LLaMA? I've tested fp16 on my end and didn't encounter any particularly strange situations. |
91c8d3d
to
07cde33
Compare
@ArthurZucker @zucchini-nlp @ydshieh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late to re-review!
Really nice numbers, thanks for deep diving. It is completely acceptable IMO! Even more so because it's on the Qwen2VLAttention
(so not Qwen2VLSdpaAttention
or Flash which would probably no suffer from this!)
Let's merge this!
@ArthurZucker Thank you for the review. The CI has finished running. Could you please merge it? |
Yeah sorry for the wait, wanted to compare a bit with #33317 which is quite similar. |
* fix qwen2vl float16 inference bug * [run-slow] qwen2_vl
What does this PR do?
Fixes #33294
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Infer res
This image shows a table comparing the performance of three precision modes—fp32, bf16, and fp16—in text generation:
fp32: Memory usage is 36629MiB, and it generated a description of a beach scene.
bf16: Memory usage is 44105MiB, and it generated a similar description.
fp16: An error occurred, and only exclamation marks were generated.
fp16 (improved version): Memory usage is 31609MiB, and it generated a more detailed description of the beach scene.
Each column displays the memory usage and the generated text or error information.
DeBug Process
Using the hook mechanism, I exported the input and output of each layer. When both the mean and sum are nan, I consider it to be abnormal.
Thoughts on bug:
Training Comparison
This modification is also effective for training.
llama-factory
https://github.com/hiyouga/LLaMA-Factoryfp32
fp16
fp16 this PR
Personal Opinion on This Modification:
@ArthurZucker @zucchini-nlp @hiyouga @simonJJJ