Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce LISA-Llama2-13B? #82

Closed
yxchng opened this issue Nov 13, 2023 · 13 comments
Closed

How to reproduce LISA-Llama2-13B? #82

yxchng opened this issue Nov 13, 2023 · 13 comments

Comments

@yxchng
Copy link

yxchng commented Nov 13, 2023

I tried training LISA-Llama2-13B using 4x 80GB A100 with the following commands:

deepspeed --master_port=24999 train_ds.py \
  --version="liuhaotian/llava-llama-2-13b-chat-lightning-preview" \
  --dataset_dir='./datasets' \
  --vision_pretrained="sam_vit_h_4b8939.pth" \
  --dataset="sem_seg||refer_seg||vqa||reason_seg" \
  --sample_rates="9,3,3,1" \
  --exp_name="lisa_llama2_13b_e20" \
  --epochs='20' \
  --batch_size='4'

and get the following results:

epochs giou ciou
0 0.4143 0.5198
1 0.4845 0.5075
2 0.5178 0.5491
3 0.5231 0.5707
4 0.5264 0.5847
5 0.5333 0.5899
6 0.5564 0.5851
7 0.5474 0.5773
8 0.5690 0.6171
9 0.5305 0.5607
10 0.5543 0.5952
11 0.5596 0.6115
12 0.5331 0.5851
13 0.5427 0.5793
14 0.5477 0.5991
15 0.5485 0.5719
16 0.5531 0.6029
17 0.5552 0.5952
18 0.5589 0.5998
19 0.5588 0.5958

which is a far-cry from the results in the paper:

model giou ciou
LISA-Llama2-13B 0.600 0.678

Do LISA-Llama2-13B use different hyper-parameters? What am I doing wrong? How can I reproduce the LISA-Llama2-13B results?

@X-Lai
Copy link
Contributor

X-Lai commented Nov 14, 2023

Hi, I think your setting is the same as mine. I think it may due to the high variance on the validation set. We will soon release the test set (which is larger than the val set), and then you can make evaluation on it.

@baoxiaoyi
Copy link

Hi, I think your setting is the same as mine. I think it may due to the high variance on the validation set. We will soon release the test set (which is larger than the val set), and then you can make evaluation on it.

I noticed that in your code the 200 images in the validation set are all sampled for evaluation, thus I failed to understand the "high variance" you mentioned here. How does it lead to the failure of result reproduction? (I trained the 7B-v0-ft model and got giou0.408,ciou0.435, which is lower than the reported giou0.529, ciou0.54 in your paper)

@X-Lai
Copy link
Contributor

X-Lai commented Dec 6, 2023

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

@X-Lai X-Lai closed this as completed Dec 6, 2023
@X-Lai X-Lai reopened this Dec 6, 2023
@baoxiaoyi
Copy link

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

The only difference lies in that
1.I directly used the newest LLAVA-1.5 based on Vicuna v1.5.
2.I didn't use the flash-attn due to my current cuda version.
I thought the first point may lead to a better result than the LLAVA-v1-lightning. Have you tried on that? I will also downgrade to the former llava to have a try.

@X-Lai
Copy link
Contributor

X-Lai commented Dec 6, 2023

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

@X-Lai X-Lai closed this as completed Dec 6, 2023
@baoxiaoyi
Copy link

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

Do I understand your reply correctly?
1."control max context length" refers to the "args.model_max_length", which should be changed to promise the whole sentence (including the image) to be input correctly.
2."clip" should be changed to "clip-vit-large-patch14-336"
3."image token length" refers to the "args.out_dim"
4.The above details are everything necessary to change for llava-1.5.

@BinZhu-ece
Copy link

Hi, I think your setting is the same as mine. I think it may due to the high variance on the validation set. We will soon release the test set (which is larger than the val set), and then you can make evaluation on it.

Excellent work! When will you release this test set dataset?

@Amark-cheey
Copy link

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

The only difference lies in that 1.I directly used the newest LLAVA-1.5 based on Vicuna v1.5. 2.I didn't use the flash-attn due to my current cuda version. I thought the first point may lead to a better result than the LLAVA-v1-lightning. Have you tried on that? I will also downgrade to the former llava to have a try.

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

The only difference lies in that 1.I directly used the newest LLAVA-1.5 based on Vicuna v1.5. 2.I didn't use the flash-attn due to my current cuda version. I thought the first point may lead to a better result than the LLAVA-v1-lightning. Have you tried on that? I will also downgrade to the former llava to have a try.

Have you successfully run the code for LLaVA-1.5?

@Amark-cheey
Copy link

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance?
pred_embeddings = last_hidden_state[seg_token_mask]
[rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

@bxhsort
Copy link

bxhsort commented Nov 21, 2024

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.
The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

@Amark-cheey
Copy link

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.
The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

parser.add_argument("--model_max_length", default=575, type=int) this one?

@Amark-cheey
Copy link

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.
The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

@sjauhri
Copy link

sjauhri commented Jan 21, 2025

Has anyone managed to solve this issue? @X-Lai @yxchng
Are there changes in the configuration to be made to reproduce LISA-Llama2-13B?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants