LLaVa Left Padding Got Weird Results #28184

SeungyounShin · 2023-12-21T15:10:46Z

System Info

Reproduce :

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
    "cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)
for key in inputs:
    inputs[key] = inputs[key].to("cuda")
    print(key, inputs[key].shape)

# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(outputs)

This will outputs :

["\n \nUSER: What's the the difference of two images?\nASSISTANT: In the two images, the primary difference is the presence of a flower in the dog's mouth. In the first image, the dog is holding a flower in its mouth, while in the second image, the dog is not holding a flower. This subtle change in the scene highlights the dog's interaction with the flower, and it may evoke different emotions or interpretations depending on the viewer's perspective.", '\nUSER: Describe the image.\nASSISTANT: The dog is a \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\nUSER: Describe the image.\nASSISTANT: The \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nЪ schließ']

I checked images are rightly placed. but for batch2 and 3
It's consist of lots of padding (False x 583)
[False x 583, False, True x 576 , False, False, False, False, False, False, False, False, False, False, False, False, False, False]

I guess llava doesn't see this kind of prefix on training phase would result in weird behavior.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

stated at above

Expected behavior

skip

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-12-21T16:17:46Z

cc @younesbelkada @ArthurZucker

younesbelkada · 2023-12-21T16:24:00Z

hi @SeungyounShin
What transformers version are you using?
in the first input prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:" you passed two images; note multi-image query is not well supported for Llava-like models as they have not excplicitly trained for that according to the authors.

younesbelkada · 2023-12-21T16:24:26Z

btw you can also to inputs = inputs.to("cuda")

SeungyounShin · 2023-12-21T18:18:17Z

I am currently using 4.37.0.dev0

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\n<image>\nUSER: Describe the two images.\nASSISTANT:"
# prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

This will output :

 [1]
USER: What's the the difference of two images?
ASSISTANT: In the two images, the primary difference is the presence of a flower in the dog's mouth. In the first image, the dog is holding a flower in its mouth, while in the second image, the dog is not holding a flower. This subtle change in the scene highlights the dog's interaction with the flower, and it may evoke different emotions or interpretations depending on the viewer's perspective.

 [2]
USER: Describe the two images.
ASSISTANT: The two images show a cute brown and white dog standing on a grassy hill. In one image, the dog is holding a green leaf in its mouth, while in the other, it is holding a yellow flower. Both images capture the dog's playful and curious nature as it interacts with its surroundings.

The implementation appears to be functioning correctly. Upon reviewing, I noticed that the final embeddingl effectively supports multiple images.

SeungyounShin · 2023-12-21T22:49:44Z

modeling_llava.py#L304 is this expected behavior?

Considering the relationship between image patches. Specifically, if image patch 100 references image patch 84, it appears there shouldn't be any issue. I haven't come across any mention of masking related to image patches in the LLaVa paper. Is this approach used in the official implementation of LLaVa?

**It would be beneficial to have an example of fine-tuning for multi-images. Would you be open to accepting a Pull Request (PR) that includes an example of fine-tuning on multi-images?

younesbelkada · 2023-12-22T14:59:41Z

Hi @SeungyounShin
Indeed it seems you are correct, despite the model not being explicitly trained for this, it seems to perform well on some examples as you shared, which is very nice! cc @haotian-liu for visibility!
I suspect something is off with SDPA (torch.scaled_dot_product_attention not being able to deal with arbitraty attention masks. I need some time to properly investigate how to fix this. Meanwhile you can do two things
1- Use the eager attention implementation:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
+ model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", attn_implementation="eager").to(
    "cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)
for key in inputs:
    inputs[key] = inputs[key].to("cuda")
    print(key, inputs[key].shape)

# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(outputs)

2- Process the prompts one-by-one instead of performing batched generation

cc @fxmarty as well as this is about SDPA

fxmarty · 2024-01-11T13:40:20Z

@younesbelkada is this in the end not related to sdpa?

younesbelkada · 2024-02-05T18:59:39Z

@fxmarty I think it is related to SDPA as Llava model creates non-standard attention mask and the script fails for SDPA

haotian-liu · 2024-02-05T19:00:56Z

@younesbelkada i also found similar issue when i tried to implement batch inference. do you know why it creates non-standard attention mask? it should theoretically use the standard autoregressive mask?

younesbelkada · 2024-02-05T20:00:25Z

@haotian-liu I think this happens in the case you try to have different numbers of images per prompt + multi-turn chat. If let's say you have 2 images in the first prompt and one image on the second prompt, your attention mask will look like

[image 1] [prompt 1] [image 2] [prompt 2]
0 0 0.. 0  1 1 1 1 1 .. 1 0 0 0 ... 0 1 1 1 1 1 ... 1
[image 3] [prompt 3]
0 0 0.. 0  1 1 1 1 1 .. 1

I think the reason that for the prompt

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

We are getting a non-standard attention mask is the presence of \n between the two <image> tokens for prompt1. Can you try out the following:

- prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
+ prompt1 = "<image><image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

That way the attention mask will become standard I believe cc @haotian-liu what do you think?

haotian-liu · 2024-02-06T07:55:01Z

@younesbelkada Thank you! i thought it may be due to a different reason, as the strange behavior occured when I previously tried to do batch inference with one image for each sample. I'll try to find another example later to see if it still exists.

fxmarty · 2024-03-22T11:57:44Z

Hi, this should be fixed by #29389. Could you give a second try? Thank you for the report!

younesbelkada mentioned this issue Dec 22, 2023

[Llava] Fix llava index errors #28032

Merged

huggingface deleted a comment from github-actions bot Feb 5, 2024

fxmarty mentioned this issue Mar 1, 2024

Correct llava mask & fix missing setter for vocab_size #29389

Merged

fxmarty closed this as completed in #29389 Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVa Left Padding Got Weird Results #28184

LLaVa Left Padding Got Weird Results #28184

SeungyounShin commented Dec 21, 2023

amyeroberts commented Dec 21, 2023

younesbelkada commented Dec 21, 2023

younesbelkada commented Dec 21, 2023

SeungyounShin commented Dec 21, 2023 •

edited

Loading

SeungyounShin commented Dec 21, 2023 •

edited

Loading

younesbelkada commented Dec 22, 2023

fxmarty commented Jan 11, 2024

younesbelkada commented Feb 5, 2024

haotian-liu commented Feb 5, 2024

younesbelkada commented Feb 5, 2024 •

edited

Loading

haotian-liu commented Feb 6, 2024

fxmarty commented Mar 22, 2024

LLaVa Left Padding Got Weird Results #28184

LLaVa Left Padding Got Weird Results #28184

Comments

SeungyounShin commented Dec 21, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Dec 21, 2023

younesbelkada commented Dec 21, 2023

younesbelkada commented Dec 21, 2023

SeungyounShin commented Dec 21, 2023 • edited Loading

SeungyounShin commented Dec 21, 2023 • edited Loading

younesbelkada commented Dec 22, 2023

fxmarty commented Jan 11, 2024

younesbelkada commented Feb 5, 2024

haotian-liu commented Feb 5, 2024

younesbelkada commented Feb 5, 2024 • edited Loading

haotian-liu commented Feb 6, 2024

fxmarty commented Mar 22, 2024

SeungyounShin commented Dec 21, 2023 •

edited

Loading

SeungyounShin commented Dec 21, 2023 •

edited

Loading

younesbelkada commented Feb 5, 2024 •

edited

Loading