How to reproduce LISA-Llama2-13B? #82

yxchng · 2023-11-13T13:36:55Z

I tried training LISA-Llama2-13B using 4x 80GB A100 with the following commands:

deepspeed --master_port=24999 train_ds.py \
  --version="liuhaotian/llava-llama-2-13b-chat-lightning-preview" \
  --dataset_dir='./datasets' \
  --vision_pretrained="sam_vit_h_4b8939.pth" \
  --dataset="sem_seg||refer_seg||vqa||reason_seg" \
  --sample_rates="9,3,3,1" \
  --exp_name="lisa_llama2_13b_e20" \
  --epochs='20' \
  --batch_size='4'

and get the following results:

epochs	giou	ciou
0	0.4143	0.5198
1	0.4845	0.5075
2	0.5178	0.5491
3	0.5231	0.5707
4	0.5264	0.5847
5	0.5333	0.5899
6	0.5564	0.5851
7	0.5474	0.5773
8	0.5690	0.6171
9	0.5305	0.5607
10	0.5543	0.5952
11	0.5596	0.6115
12	0.5331	0.5851
13	0.5427	0.5793
14	0.5477	0.5991
15	0.5485	0.5719
16	0.5531	0.6029
17	0.5552	0.5952
18	0.5589	0.5998
19	0.5588	0.5958

which is a far-cry from the results in the paper:

model	giou	ciou
LISA-Llama2-13B	0.600	0.678

Do LISA-Llama2-13B use different hyper-parameters? What am I doing wrong? How can I reproduce the LISA-Llama2-13B results?

The text was updated successfully, but these errors were encountered:

X-Lai · 2023-11-14T15:32:00Z

Hi, I think your setting is the same as mine. I think it may due to the high variance on the validation set. We will soon release the test set (which is larger than the val set), and then you can make evaluation on it.

baoxiaoyi · 2023-12-06T03:43:29Z

Hi, I think your setting is the same as mine. I think it may due to the high variance on the validation set. We will soon release the test set (which is larger than the val set), and then you can make evaluation on it.

I noticed that in your code the 200 images in the validation set are all sampled for evaluation, thus I failed to understand the "high variance" you mentioned here. How does it lead to the failure of result reproduction? (I trained the 7B-v0-ft model and got giou0.408,ciou0.435, which is lower than the reported giou0.529, ciou0.54 in your paper)

X-Lai · 2023-12-06T04:54:11Z

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

baoxiaoyi · 2023-12-06T06:47:50Z

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

The only difference lies in that
1.I directly used the newest LLAVA-1.5 based on Vicuna v1.5.
2.I didn't use the flash-attn due to my current cuda version.
I thought the first point may lead to a better result than the LLAVA-v1-lightning. Have you tried on that? I will also downgrade to the former llava to have a try.

X-Lai · 2023-12-06T07:45:46Z

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

baoxiaoyi · 2023-12-06T09:07:42Z

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

Do I understand your reply correctly?
1."control max context length" refers to the "args.model_max_length", which should be changed to promise the whole sentence (including the image) to be input correctly.
2."clip" should be changed to "clip-vit-large-patch14-336"
3."image token length" refers to the "args.out_dim"
4.The above details are everything necessary to change for llava-1.5.

BinZhu-ece · 2024-01-05T03:17:43Z

Hi, I think your setting is the same as mine. I think it may due to the high variance on the validation set. We will soon release the test set (which is larger than the val set), and then you can make evaluation on it.

Excellent work! When will you release this test set dataset?

Amark-cheey · 2024-11-13T08:22:09Z

@baoxiaoyi Hi, some issues (#41) have reported that the results of 7B-v0-ft model can be successfully reproduced. Have you strictly followed the instruction in README file?

The only difference lies in that 1.I directly used the newest LLAVA-1.5 based on Vicuna v1.5. 2.I didn't use the flash-attn due to my current cuda version. I thought the first point may lead to a better result than the LLAVA-v1-lightning. Have you tried on that? I will also downgrade to the former llava to have a try.

Have you successfully run the code for LLaVA-1.5?

Amark-cheey · 2024-11-21T07:29:53Z

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance?
pred_embeddings = last_hidden_state[seg_token_mask]
[rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

bxhsort · 2024-11-21T10:22:23Z

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.
The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

Amark-cheey · 2024-11-21T11:16:09Z

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.
The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

parser.add_argument("--model_max_length", default=575, type=int) this one？

Amark-cheey · 2024-11-21T11:31:55Z

LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.
The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

sjauhri · 2025-01-21T12:26:54Z

Has anyone managed to solve this issue? @X-Lai @yxchng
Are there changes in the configuration to be made to reproduce LISA-Llama2-13B?

X-Lai closed this as completed Dec 6, 2023

X-Lai reopened this Dec 6, 2023

X-Lai closed this as completed Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reproduce LISA-Llama2-13B? #82

How to reproduce LISA-Llama2-13B? #82

yxchng commented Nov 13, 2023

X-Lai commented Nov 14, 2023

baoxiaoyi commented Dec 6, 2023

X-Lai commented Dec 6, 2023 •

edited

Loading

baoxiaoyi commented Dec 6, 2023

X-Lai commented Dec 6, 2023

baoxiaoyi commented Dec 6, 2023

BinZhu-ece commented Jan 5, 2024

Amark-cheey commented Nov 13, 2024

Amark-cheey commented Nov 21, 2024

bxhsort commented Nov 21, 2024

Amark-cheey commented Nov 21, 2024

Amark-cheey commented Nov 21, 2024

sjauhri commented Jan 21, 2025

How to reproduce LISA-Llama2-13B? #82

How to reproduce LISA-Llama2-13B? #82

Comments

yxchng commented Nov 13, 2023

X-Lai commented Nov 14, 2023

baoxiaoyi commented Dec 6, 2023

X-Lai commented Dec 6, 2023 • edited Loading

baoxiaoyi commented Dec 6, 2023

X-Lai commented Dec 6, 2023

baoxiaoyi commented Dec 6, 2023

BinZhu-ece commented Jan 5, 2024

Amark-cheey commented Nov 13, 2024

Amark-cheey commented Nov 21, 2024

bxhsort commented Nov 21, 2024

Amark-cheey commented Nov 21, 2024

Amark-cheey commented Nov 21, 2024

sjauhri commented Jan 21, 2025

X-Lai commented Dec 6, 2023 •

edited

Loading