-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaVa Left Padding Got Weird Results #28184
Comments
hi @SeungyounShin |
btw you can also to |
I am currently using prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\n<image>\nUSER: Describe the two images.\nASSISTANT:"
# prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)
inputs = processor(
text=[prompt1, prompt2],
images=[image1, image2, image1, image2],
return_tensors="pt",
padding=True,
) This will output :
The implementation appears to be functioning correctly. Upon reviewing, I noticed that the final embeddingl effectively supports multiple images. |
modeling_llava.py#L304 is this expected behavior? Considering the relationship between image patches. Specifically, if image patch 100 references image patch 84, it appears there shouldn't be any issue. I haven't come across any mention of masking related to image patches in the LLaVa paper. Is this approach used in the official implementation of **It would be beneficial to have an example of fine-tuning for multi-images. Would you be open to accepting a Pull Request (PR) that includes an example of fine-tuning on multi-images? |
Hi @SeungyounShin from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
+ model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", attn_implementation="eager").to(
"cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)
inputs = processor(
text=[prompt1, prompt2, prompt3],
images=[image1, image2, image1, image2],
return_tensors="pt",
padding=True,
)
for key in inputs:
inputs[key] = inputs[key].to("cuda")
print(key, inputs[key].shape)
# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(outputs) 2- Process the prompts one-by-one instead of performing batched generation cc @fxmarty as well as this is about SDPA |
@younesbelkada is this in the end not related to sdpa? |
@fxmarty I think it is related to SDPA as Llava model creates non-standard attention mask and the script fails for SDPA |
@younesbelkada i also found similar issue when i tried to implement batch inference. do you know why it creates non-standard attention mask? it should theoretically use the standard autoregressive mask? |
@haotian-liu I think this happens in the case you try to have different numbers of images per prompt + multi-turn chat. If let's say you have 2 images in the first prompt and one image on the second prompt, your attention mask will look like [image 1] [prompt 1] [image 2] [prompt 2]
0 0 0.. 0 1 1 1 1 1 .. 1 0 0 0 ... 0 1 1 1 1 1 ... 1
[image 3] [prompt 3]
0 0 0.. 0 1 1 1 1 1 .. 1 I think the reason that for the prompt prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)
inputs = processor(
text=[prompt1, prompt2, prompt3],
images=[image1, image2, image1, image2],
return_tensors="pt",
padding=True,
) We are getting a non-standard attention mask is the presence of - prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
+ prompt1 = "<image><image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)
inputs = processor(
text=[prompt1, prompt2, prompt3],
images=[image1, image2, image1, image2],
return_tensors="pt",
padding=True,
) That way the attention mask will become standard I believe cc @haotian-liu what do you think? |
@younesbelkada Thank you! i thought it may be due to a different reason, as the strange behavior occured when I previously tried to do batch inference with one image for each sample. I'll try to find another example later to see if it still exists. |
Hi, this should be fixed by #29389. Could you give a second try? Thank you for the report! |
System Info
Reproduce :
This will outputs :
I checked images are rightly placed. but for batch2 and 3
It's consist of lots of padding (False x 583)
[False x 583, False, True x 576 , False, False, False, False, False, False, False, False, False, False, False, False, False, False]
I guess llava doesn't see this kind of prefix on training phase would result in weird behavior.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
stated at above
Expected behavior
skip
The text was updated successfully, but these errors were encountered: