Fix qwen2vl float16 inference bug #33312

GeLee-Q · 2024-09-05T03:46:13Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ✅] Did you read the contributor guideline,
Pull Request section?
[ ✅] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Infer res

This image shows a table comparing the performance of three precision modes—fp32, bf16, and fp16—in text generation:

fp32: Memory usage is 36629MiB, and it generated a description of a beach scene.
bf16: Memory usage is 44105MiB, and it generated a similar description.
fp16: An error occurred, and only exclamation marks were generated.
fp16 (improved version): Memory usage is 31609MiB, and it generated a more detailed description of the beach scene.
Each column displays the memory usage and the generated text or error information.

DeBug Process

Using the hook mechanism, I exported the input and output of each layer. When both the mean and sum are nan, I consider it to be abnormal.

# fp16 hook 
Layer model.layers.0.self_attn.o_proj input - mean: nan, sum: nan, shape: torch.Size([1, 3602, 1536]), has inf: False, has nan: True, first 5 values: [-0.33251953125, 0.1829833984375, 0.064453125, -0.634765625, 0.5859375], last 5 values: [0.038909912109375, -0.08502197265625, 0.4794921875, 0.0762939453125, -0.1522216796875], 
Layer model.layers.0.self_attn.o_proj output - mean: nan, sum: nan, shape: torch.Size([1, 3602, 1536]), has inf: False, has nan: True, first 5 values: [0.0076446533203125, 0.019805908203125, 0.03533935546875, -0.045318603515625, -0.03057861328125], last 5 values: [-0.296875, 0.09539794921875, -0.0924072265625, 0.0084075927734375, -0.09539794921875], 

# fp32 hook 
Layer model.layers.0.self_attn.o_proj input - mean: 0.012129077687859535, sum: 67106.2109375, shape: torch.Size([1, 3602, 1536]), first 5 values: [-0.33251953125, 0.182861328125, 0.06439208984375, -0.63427734375, 0.5859375], last 5 values: [0.03431564196944237, -0.09479156881570816, 0.47655850648880005, 0.06808013468980789, -0.1560981571674347]
Layer model.layers.0.self_attn.o_proj output - mean: 0.008136402815580368, sum: 45016.046875, shape: torch.Size([3602, 1536]), first 5 values: [0.007521241903305054, 0.019816547632217407, 0.03532881289720535, -0.04532395675778389, -0.03064415045082569], last 5 values: [-0.2958013117313385, 0.09288226813077927, -0.08813470602035522, 0.007658728398382664, -0.09701263904571533]

Further debugging in the forward process revealed that the attn_weights contained inf values. After applying softmax, nan values appeared, which subsequently caused all the following results to become nan.

class Qwen2VLAttention(nn.Module):
		...
		attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
		...

Thoughts on bug:

First, the problematic code is identical to that in LLaMA, so it’s unlikely that the issue is related to the way Torch is written.
Second, LLaMA’s fp16 inference works without issues, making it difficult to suspect that the precision problem lies in the underlying operator called by torch.matmul.
Lastly, I cautiously suspect that the open-source weights of Qwen-2VL may not be suitable for float16 inference, especially since the Qwen documentation indicates that fp16 is not recommended.

Training Comparison

This modification is also effective for training.

code llama-factory https://github.com/hiyouga/LLaMA-Factory

fp32

{"current_steps": 1, "total_steps": 100, "loss": 2.4317, "learning_rate": 1e-05, "epoch": 0.7272727272727273, "percentage": 1.0, "elapsed_time": "0:00:02", "remaining_time": "0:04:45"}
{"current_steps": 2, "total_steps": 100, "loss": 2.3595, "learning_rate": 2e-05, "epoch": 1.4545454545454546, "percentage": 2.0, "elapsed_time": "0:00:05", "remaining_time": "0:04:15"}
{"current_steps": 3, "total_steps": 100, "loss": 2.4036, "learning_rate": 3e-05, "epoch": 2.1818181818181817, "percentage": 3.0, "elapsed_time": "0:00:07", "remaining_time": "0:03:55"}
{"current_steps": 4, "total_steps": 100, "loss": 2.3881, "learning_rate": 4e-05, "epoch": 2.909090909090909, "percentage": 4.0, "elapsed_time": "0:00:09", "remaining_time": "0:03:42"}
{"current_steps": 5, "total_steps": 100, "loss": 2.3942, "learning_rate": 5e-05, "epoch": 3.6363636363636362, "percentage": 5.0, "elapsed_time": "0:00:11", "remaining_time": "0:03:34"}

fp16

{"current_steps": 1, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.09195402298850575, "percentage": 0.1, "elapsed_time": "0:00:02", "remaining_time": "0:39:13"}
{"current_steps": 2, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.1839080459770115, "percentage": 0.2, "elapsed_time": "0:00:04", "remaining_time": "0:35:30"}
{"current_steps": 3, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.27586206896551724, "percentage": 0.3, "elapsed_time": "0:00:06", "remaining_time": "0:34:16"}
{"current_steps": 4, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.367816091954023, "percentage": 0.4, "elapsed_time": "0:00:08", "remaining_time": "0:33:24"}
{"current_steps": 5, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.45977011494252873, "percentage": 0.5, "elapsed_time": "0:00:09", "remaining_time": "0:32:48"}

fp16 this PR

{"current_steps": 1, "total_steps": 1000, "loss": 4.5531, "learning_rate": 1.0000000000000002e-06, "epoch": 0.09195402298850575, "percentage": 0.1, "elapsed_time": "0:00:02", "remaining_time": "0:40:28"}
{"current_steps": 2, "total_steps": 1000, "loss": 4.5833, "learning_rate": 2.0000000000000003e-06, "epoch": 0.1839080459770115, "percentage": 0.2, "elapsed_time": "0:00:04", "remaining_time": "0:35:50"}
{"current_steps": 3, "total_steps": 1000, "loss": 4.1749, "learning_rate": 3e-06, "epoch": 0.27586206896551724, "percentage": 0.3, "elapsed_time": "0:00:06", "remaining_time": "0:34:13"}
{"current_steps": 4, "total_steps": 1000, "loss": 4.1741, "learning_rate": 4.000000000000001e-06, "epoch": 0.367816091954023, "percentage": 0.4, "elapsed_time": "0:00:08", "remaining_time": "0:33:19"}
{"current_steps": 5, "total_steps": 1000, "loss": 4.8369, "learning_rate": 5e-06, "epoch": 0.45977011494252873, "percentage": 0.5, "elapsed_time": "0:00:09", "remaining_time": "0:32:58"}
{"c

Personal Opinion on This Modification:

This is just a trick that replaces all inf values resulting from torch.matmul with zero.
This is not a fundamental solution to the problem, but it shows significant effects in fp16 inference and LoRA training. Feedback and appropriate suggestions from everyone are needed.

@ArthurZucker @zucchini-nlp @hiyouga @simonJJJ

ArthurZucker

Hey! This indeed looks super useful, my 2 main concerns:

speed: at inference in fp16 this will be slower

Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well

GeLee-Q · 2024-09-06T12:00:28Z

Hey! This indeed looks super useful, my 2 main concerns:

speed: at inference in fp16 this will be slower

Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well

Are you suggesting that adding this operation will cause the inference speed to slow down in FP16? But if the precision is off, doesn't speed seem less important? I will test the full process inference speed tomorrow to compare. Additionally, regarding llama, I will test it using float16. This is a very interesting phenomenon.

ArthurZucker

Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!

GeLee-Q · 2024-09-06T12:31:59Z

Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!

Alright, I'll provide some data tomorrow.

GeLee-Q · 2024-09-07T08:02:00Z

Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!

@ArthurZucker Hello, my slow test has been completed. This operation incurs some overhead, but it's not significant. It accounts for 3.51% of the total CPU time and 0.6% of the total CUDA time across all operators. Trading this small overhead for improved accuracy is, in my opinion, acceptable overall.

Comparison results

input-pic https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg
input-text "Describe this image."

profile

if query_states.dtype == torch.float16:
     attn_weights = torch.where(torch.isinf(attn_weights), torch.zeros_like(attn_weights), attn_weights)

This modification added three operations: aten::where, aten::isinf, and aten::zeros_like.

aten::where:
- Has high CPU usage, accounting for 0.98% of the total CPU time
- Also has high CUDA usage, accounting for 0.55% of the total CUDA time
aten::isinf: Has the longest total CPU usage time (160.241ms), accounting for 1.70%, which is the highest among the three
aten::zeros_like: Has relatively low CPU and CUDA usage

These three operations collectively consume approximately 3.51% of the total CPU time. From the nysys results, we can see that aten::where accounts for 0.6% of the total overhead among all operators.

torch profile

Operation	Self CPU %	Self CPU	CPU total %	CPU total	CPU time avg	Self CUDA	Self CUDA %	CUDA total	CUDA time avg	CUDA Mem	Self CUDA Mem	# of Calls	Total GFLOPs
aten::where	0.45%	42.024ms	0.98%	92.229ms	27.416us	24.981ms	0.55%	26.552ms	7.893us	8.39 Gb	0 b	3364	--
aten::isinf	0.14%	13.013ms	1.70%	160.241ms	49.335us	0.000us	0.00%	33.928ms	10.446us	5.71 Gb	-7.46 Gb	3248	--
aten::zeros_like	0.15%	13.792ms	0.83%	78.403ms	24.139us	0.000us	0.00%	12.983ms	3.997us	8.38 Gb	-258.50 Kb	3248	--

nysys profile

Time (%)	Total Time (ns)	Num Calls	Avg (ns)	Med (ns)	Min (ns)	Max (ns)	StdDev (ns)	Name
0.6	26321648	3248	8104.0	2816.0	2175	624031	57133.6	void at::native::vectorized_elementwise_kernel<(int)4, at::native::::where_kernel_impl(at:.

test code

...

def log_time_taken(start_time, step_name):
    end_time = time.time()
    logger.info(f"Time taken for {step_name}: {end_time - start_time:.4f} seconds")

...
logger.info("Starting model inference...")
inference_start = time.time()

output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    top_k=None,
    top_p=None,
    temperature=None,
)

log_time_taken(inference_start, "Model Inference")

change dtype

# fp32, fp16, bf16 for torch_dtype
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
)
model.to(device)

GeLee-Q · 2024-09-07T08:17:09Z

Hey! This indeed looks super useful, my 2 main concerns:

speed: at inference in fp16 this will be slower

Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well

Can you provide an example of an issue with LLaMA? I've tested fp16 on my end and didn't encounter any particularly strange situations.

GeLee-Q · 2024-09-13T15:11:24Z

@ArthurZucker @zucchini-nlp @ydshieh
Hello, I have completed the speed and accuracy tests. Is there anything else that needs to be done? I have already added the "run-slow" tag. Could you please trigger the action and then merge it in?

ArthurZucker

Sorry for being late to re-review!
Really nice numbers, thanks for deep diving. It is completely acceptable IMO! Even more so because it's on the Qwen2VLAttention (so not Qwen2VLSdpaAttention or Flash which would probably no suffer from this!)

Let's merge this!

GeLee-Q · 2024-09-18T11:02:11Z

Sorry for being late to re-review! Really nice numbers, thanks for deep diving. It is completely acceptable IMO! Even more so because it's on the Qwen2VLAttention (so not Qwen2VLSdpaAttention or Flash which would probably no suffer from this!)

Let's merge this!

@ArthurZucker Thank you for the review. The CI has finished running. Could you please merge it?

ArthurZucker · 2024-09-20T23:33:03Z

Yeah sorry for the wait, wanted to compare a bit with #33317 which is quite similar.

* fix qwen2vl float16 inference bug * [run-slow] qwen2_vl

ShuaiBai623 mentioned this pull request Sep 5, 2024

qwen2-vl float16 推理报错 QwenLM/Qwen2.5-VL#77

Closed

ArthurZucker reviewed Sep 6, 2024

View reviewed changes

ArthurZucker approved these changes Sep 6, 2024

View reviewed changes

hiyouga mentioned this pull request Sep 7, 2024

npu上qwen2-vl-7b推理问题 hiyouga/LLaMA-Factory#5381

Closed

1 task

fyabc mentioned this pull request Sep 13, 2024

Docker 、vllm run qwen2-vl RuntimeError: FlashAttention only supports Ampere GPUs or newer. QwenLM/Qwen2.5-VL#67

Closed

fix qwen2vl float16 inference bug

07cde33

GeLee-Q force-pushed the fix-qwen2vl-fp16-inference-precision branch from 91c8d3d to 07cde33 Compare September 13, 2024 15:03

[run-slow] qwen2_vl

4b6c72d

ArthurZucker added the float16 label Sep 17, 2024

ArthurZucker approved these changes Sep 17, 2024

View reviewed changes

LysandreJik approved these changes Sep 20, 2024

View reviewed changes

LysandreJik merged commit e9356a4 into huggingface:main Sep 20, 2024
16 checks passed

ArthurZucker mentioned this pull request Oct 1, 2024

Fix: Qwen2 eager attention bug #33317

Closed

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Fix qwen2vl float16 inference bug (huggingface#33312)

1e96718

* fix qwen2vl float16 inference bug * [run-slow] qwen2_vl

kjohew mentioned this pull request Dec 8, 2024

Qwen2vl float16 inference bug in naive attention #35151

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix qwen2vl float16 inference bug #33312

Fix qwen2vl float16 inference bug #33312

GeLee-Q commented Sep 5, 2024

ArthurZucker left a comment

GeLee-Q commented Sep 6, 2024

ArthurZucker left a comment

GeLee-Q commented Sep 6, 2024

GeLee-Q commented Sep 7, 2024 •

edited

Loading

GeLee-Q commented Sep 7, 2024

GeLee-Q commented Sep 13, 2024

ArthurZucker left a comment

GeLee-Q commented Sep 18, 2024

ArthurZucker commented Sep 20, 2024

Fix qwen2vl float16 inference bug #33312

Fix qwen2vl float16 inference bug #33312

Conversation

GeLee-Q commented Sep 5, 2024

What does this PR do?

Before submitting

Infer res

DeBug Process

Thoughts on bug:

Training Comparison

fp32

fp16

fp16 this PR

Personal Opinion on This Modification:

ArthurZucker left a comment

Choose a reason for hiding this comment

GeLee-Q commented Sep 6, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

GeLee-Q commented Sep 6, 2024

GeLee-Q commented Sep 7, 2024 • edited Loading

Comparison results

profile

torch profile

nysys profile

test code

change dtype

GeLee-Q commented Sep 7, 2024

GeLee-Q commented Sep 13, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

GeLee-Q commented Sep 18, 2024

ArthurZucker commented Sep 20, 2024

GeLee-Q commented Sep 7, 2024 •

edited

Loading