Trainable parameters in each stage #15

jujeongho0 · 2025-02-03T09:06:24Z

Thanks for great work!

I have a question about trainable parameters in each stage.

Here's what I think, but is it right?

Stage 1: Vision Encoder + Projector
Stage 2: Vision Encoder + Projector + LLM
Stage 3: Vision Encoder + Video Compressor + Projector + LLM (I'm curious about this part.)
Stage 4: Vision Encoder + Video Compressor + Projector + LLM

Thank you.

lixin4ever · 2025-02-03T11:44:32Z

Roughly correct except for the video compressor.

The devised video compressor is not a trainable component but a deterministic rule-based function applied to video input only. In the video compressor, we calculate the difference between the consecutive patches in the temporal dimension and discard those very similar (i.e., the patch difference is very small) to the previous ones.

dominickrei · 2025-02-09T17:52:52Z

A related question, how are these training stages integrated into the training scripts?

From what I understand, the training arguments for module-specific learning rates control which modules are trainable (train.py L608-622), but all scripts in scripts/train/ pass all 3 learning rates. I appreciate any clarification if I am misunderstanding the code!

lixin4ever added the good first issue Good for newcomers label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainable parameters in each stage #15

Trainable parameters in each stage #15

jujeongho0 commented Feb 3, 2025

lixin4ever commented Feb 3, 2025

dominickrei commented Feb 9, 2025

Trainable parameters in each stage #15

Trainable parameters in each stage #15

Comments

jujeongho0 commented Feb 3, 2025

lixin4ever commented Feb 3, 2025

dominickrei commented Feb 9, 2025