Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaVa Left Padding Got Weird Results #28184

Closed
4 tasks
SeungyounShin opened this issue Dec 21, 2023 · 12 comments · Fixed by #29389
Closed
4 tasks

LLaVa Left Padding Got Weird Results #28184

SeungyounShin opened this issue Dec 21, 2023 · 12 comments · Fixed by #29389

Comments

@SeungyounShin
Copy link

System Info

Reproduce :

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
    "cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)
for key in inputs:
    inputs[key] = inputs[key].to("cuda")
    print(key, inputs[key].shape)

# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(outputs)

This will outputs :

["\n \nUSER: What's the the difference of two images?\nASSISTANT: In the two images, the primary difference is the presence of a flower in the dog's mouth. In the first image, the dog is holding a flower in its mouth, while in the second image, the dog is not holding a flower. This subtle change in the scene highlights the dog's interaction with the flower, and it may evoke different emotions or interpretations depending on the viewer's perspective.", '\nUSER: Describe the image.\nASSISTANT: The dog is a \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\nUSER: Describe the image.\nASSISTANT: The \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nЪ schließ']

I checked images are rightly placed. but for batch2 and 3
It's consist of lots of padding (False x 583)
[False x 583, False, True x 576 , False, False, False, False, False, False, False, False, False, False, False, False, False, False]

I guess llava doesn't see this kind of prefix on training phase would result in weird behavior.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

stated at above

Expected behavior

skip

@amyeroberts
Copy link
Collaborator

cc @younesbelkada @ArthurZucker

@younesbelkada
Copy link
Contributor

hi @SeungyounShin
What transformers version are you using?
in the first input prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:" you passed two images; note multi-image query is not well supported for Llava-like models as they have not excplicitly trained for that according to the authors.

@younesbelkada
Copy link
Contributor

btw you can also to inputs = inputs.to("cuda")

@SeungyounShin
Copy link
Author

SeungyounShin commented Dec 21, 2023

I am currently using 4.37.0.dev0

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\n<image>\nUSER: Describe the two images.\nASSISTANT:"
# prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

This will output :

 [1]
USER: What's the the difference of two images?
ASSISTANT: In the two images, the primary difference is the presence of a flower in the dog's mouth. In the first image, the dog is holding a flower in its mouth, while in the second image, the dog is not holding a flower. This subtle change in the scene highlights the dog's interaction with the flower, and it may evoke different emotions or interpretations depending on the viewer's perspective.

 [2]
USER: Describe the two images.
ASSISTANT: The two images show a cute brown and white dog standing on a grassy hill. In one image, the dog is holding a green leaf in its mouth, while in the other, it is holding a yellow flower. Both images capture the dog's playful and curious nature as it interacts with its surroundings.

The implementation appears to be functioning correctly. Upon reviewing, I noticed that the final embeddingl effectively supports multiple images.

@SeungyounShin
Copy link
Author

SeungyounShin commented Dec 21, 2023

modeling_llava.py#L304 is this expected behavior?

Considering the relationship between image patches. Specifically, if image patch 100 references image patch 84, it appears there shouldn't be any issue. I haven't come across any mention of masking related to image patches in the LLaVa paper. Is this approach used in the official implementation of LLaVa?

**It would be beneficial to have an example of fine-tuning for multi-images. Would you be open to accepting a Pull Request (PR) that includes an example of fine-tuning on multi-images?

@younesbelkada
Copy link
Contributor

Hi @SeungyounShin
Indeed it seems you are correct, despite the model not being explicitly trained for this, it seems to perform well on some examples as you shared, which is very nice! cc @haotian-liu for visibility!
I suspect something is off with SDPA (torch.scaled_dot_product_attention not being able to deal with arbitraty attention masks. I need some time to properly investigate how to fix this. Meanwhile you can do two things
1- Use the eager attention implementation:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
+ model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", attn_implementation="eager").to(
    "cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)
for key in inputs:
    inputs[key] = inputs[key].to("cuda")
    print(key, inputs[key].shape)

# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(outputs)

2- Process the prompts one-by-one instead of performing batched generation

cc @fxmarty as well as this is about SDPA

@fxmarty
Copy link
Contributor

fxmarty commented Jan 11, 2024

@younesbelkada is this in the end not related to sdpa?

@huggingface huggingface deleted a comment from github-actions bot Feb 5, 2024
@younesbelkada
Copy link
Contributor

@fxmarty I think it is related to SDPA as Llava model creates non-standard attention mask and the script fails for SDPA

@haotian-liu
Copy link
Contributor

@younesbelkada i also found similar issue when i tried to implement batch inference. do you know why it creates non-standard attention mask? it should theoretically use the standard autoregressive mask?

@younesbelkada
Copy link
Contributor

younesbelkada commented Feb 5, 2024

@haotian-liu I think this happens in the case you try to have different numbers of images per prompt + multi-turn chat. If let's say you have 2 images in the first prompt and one image on the second prompt, your attention mask will look like

[image 1] [prompt 1] [image 2] [prompt 2]
0 0 0.. 0  1 1 1 1 1 .. 1 0 0 0 ... 0 1 1 1 1 1 ... 1
[image 3] [prompt 3]
0 0 0.. 0  1 1 1 1 1 .. 1

I think the reason that for the prompt

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

We are getting a non-standard attention mask is the presence of \n between the two <image> tokens for prompt1. Can you try out the following:

- prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
+ prompt1 = "<image><image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

That way the attention mask will become standard I believe cc @haotian-liu what do you think?

@haotian-liu
Copy link
Contributor

@younesbelkada Thank you! i thought it may be due to a different reason, as the strange behavior occured when I previously tried to do batch inference with one image for each sample. I'll try to find another example later to see if it still exists.

@fxmarty
Copy link
Contributor

fxmarty commented Mar 22, 2024

Hi, this should be fixed by #29389. Could you give a second try? Thank you for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants