请教下vision-rope和m-rope该如何理解 #621

Lier007 · 2024-12-26T08:52:50Z

看了代码后，对vision-rope和m-rope的理解如下：

1、图片内部的vision-rope：head_dim=80, freq =（0，40，2）# last_dim=20, cat(hw)=40 此时h和w的freq都是独立的从0-40
2、多模态m-rope：head_dim=128，inv_freq =（0，128， 2）# last_dim=64，mrope-section=[16,24,24], 则h的freq是从16 * 2 - 40 * 2，而w的freq是从41 * 2 - 64 * 2

1、请教下上述理解有问题吗？
2、如果上述理解问题不大的话，想问下对于视觉内部hw是独立编码freq容易理解，但对于多模态让hw维度的freq不一致是有什么考虑吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请教下vision-rope和m-rope该如何理解 #621

请教下vision-rope和m-rope该如何理解 #621

Lier007 commented Dec 26, 2024 •

edited

Loading

请教下vision-rope和m-rope该如何理解 #621

请教下vision-rope和m-rope该如何理解 #621

Comments

Lier007 commented Dec 26, 2024 • edited Loading

Lier007 commented Dec 26, 2024 •

edited

Loading