-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Qwen2.5_VL fix from Qwen Team #2
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Much appreciated the fix! Going to take a look now, FYI @yixqiao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! I left a few questions so please take a look!
second_per_grid_ts: Optional[List[float]] = None, | ||
video_second_per_grid_ts: Optional[List[float]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay if we keep it as second_per_grid_ts
? I think generally speaking it's better if we use the same names as those of the output of the Processor
class unless there's a strong reason for us to use a different one, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term second_per_grid_ts
is a video-related parameter, so using video_second_per_grid_ts
would be more appropriate from this perspective. However, if you need to maintain consistency with the transformer
, that is also acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'll merge this PR and rename it afterwards!
tokens_per_second = getattr(hf_config.vision_config, | ||
"tokens_per_second", None) | ||
video_tokens_per_second = getattr(hf_config.vision_config, | ||
"tokens_per_second", 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
class Qwen2RMSNorm(nn.Module): | ||
|
||
def __init__(self, hidden_size, eps=1e-6): | ||
super().__init__() | ||
self.weight = nn.Parameter(torch.ones(hidden_size)) | ||
self.variance_epsilon = eps | ||
|
||
def forward(self, hidden_states): | ||
input_dtype = hidden_states.dtype | ||
hidden_states = hidden_states.to(torch.float32) | ||
variance = hidden_states.pow(2).mean(-1, keepdim=True) | ||
hidden_states = hidden_states * torch.rsqrt(variance + | ||
self.variance_epsilon) | ||
return self.weight * hidden_states.to(input_dtype) | ||
|
||
def extra_repr(self): | ||
return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand - are we using torch native rmsnorm because you observed precision issue of our RMSNorm
kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I noticed that the original code uses vLLM's RMSNorm, our native implementation, Qwen2RMSNorm, has already been tested on multiple datasets. The metrics in our report are also implemented with Qwen2RMSNorm. Considering precision issues, I have continued to use the previous implementation.
second_per_grid_ts=MultiModalFieldConfig.flat( | ||
"video", video_slices), | ||
second_per_grid_ts=MultiModalFieldConfig.batched("video"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
This update comes from the [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) team, and the update details are as follows:
本次更新内容:
fps
参数,请求时通过 mm_processor_kwargs 传递;qwen2_5_vl
的offline_inference
代码;packed_modules_mapping
配置;Update Contents:
Qwen2_5_VLPatchMerger
fromnn.SiLU()
tonn.GELU()
: It should correctly benn.GELU()
. For details, see: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L146MRotaryEmbedding
to be 4 times themax_position_embeddings
, adapting it for long-duration video requests; The scale 4 is an empirical value.fps
parameter, which can be passed during requests viamm_processor_kwargs
.offline_inference
code forqwen2_5_vl
.packed_modules_mapping
configuration.Demo code / 示例代码:
Thanks
We thank the vllm team for the outstanding work. If you have any questions, please feel free to contact us.