Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlavaForConditionalGeneration logit values are batch size dependent. #29327

Closed
ShahRutav opened this issue Feb 27, 2024 · 8 comments
Closed

LlavaForConditionalGeneration logit values are batch size dependent. #29327

ShahRutav opened this issue Feb 27, 2024 · 8 comments

Comments

@ShahRutav
Copy link

Hi @ArthurZucker,
Thanks for the response. From the comment, the possible reasons in llama models are -

  1. Left-side padding: There is no padding in my script.
  2. KV Caching in lower precision operations.

I tested with float32, my observations are as follows -

  1. No issue (mismatch < 1e-5) if I don't use images in the input and float32 precision. This is equivalent to using only the llama model.
  2. The mismatch error is very high, ~11, if I use images in input (and random for different batch sizes), even in float32.

I am attaching the updated script to reproduce the mismatch,

import torch
import requests
from PIL import Image
import peft
from transformers import BitsAndBytesConfig
from transformers import LlavaForConditionalGeneration, AutoProcessor

test_image = True
dtype = torch.float32
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(
    model_id,
    padding_side="left",
    dtype=dtype,
)
if test_image:
    prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"
    url = "https://www.ilankelman.org/stopsigns/australia.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    inputs = processor(text=[prompt], images=[image], return_tensors="pt")
else:
    prompt = "USER: What's the content of the image?\nASSISTANT:"
    inputs = processor.tokenizer(prompt, return_tensors="pt")
inputs_2 = {k: torch.cat([v, v], dim=0).clone() for k, v in inputs.items()}

model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
).eval()
if dtype != torch.float32: # no need to go through this if dtype is already float32
    peft.prepare_model_for_kbit_training(model)

with torch.no_grad():
    print(inputs['input_ids'])
    model_output = model(**inputs).logits
    model_output_2 = model(**inputs_2).logits

model_output_2 = model_output_2[:model_output.shape[0]]
assert model_output.shape == model_output_2.shape, \
    f"Shapes must be same for comparing. {model_output.shape} vs {model_output_2.shape}"
assert torch.allclose(model_output, model_output_2, atol=1e-5), \
    f"Logits are not the same with maximum difference {torch.max(torch.abs(model_output - model_output_2))}.\n \
    Values:\n{model_output}\nvs\n{model_output_2}"

Originally posted by @ShahRutav in #29282 (comment)

@ShahRutav ShahRutav changed the title Hi @ArthurZucker, LlavaForConditionalGeneration logit values are batch size dependent. Feb 27, 2024
@ArthurZucker
Copy link
Collaborator

Sorry but I cannot reproduce your error:
image

The logits are close enough.
I used this:

import torch
import requests
from PIL import Image
import peft
from transformers import BitsAndBytesConfig
from transformers import LlavaForConditionalGeneration, AutoProcessor

test_image = True
dtype = torch.float32
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(
    model_id,
    padding_side="left",
    dtype=dtype,
)
if test_image:
    prompt = "USER: <image>\nWhat's the content of the image?\nASSISTANT:"
    url = "https://www.ilankelman.org/stopsigns/australia.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    inputs = processor(text=[prompt], images=[image], return_tensors="pt")

inputs_2 =  processor(text=[prompt, prompt], images=[image, image], return_tensors="pt")

model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=dtype,).eval()

with torch.no_grad():
    print(inputs['input_ids'])
    model_output = model(**inputs).logits
    model_output_2 = model(**inputs_2).logits

model_output_2 = model_output_2[:model_output.shape[0]]
assert model_output.shape == model_output_2.shape, "Shapes must be same for comparing. {} vs {}".format(model_output.shape, model_output_2.shape)
assert torch.allclose(model_output, model_output_2, atol=1e-5), "Logits are not the same with maximum difference {}. Values:\n{} vs\n{}".format(torch.max(torch.abs(model_output - model_output_2)), model_output, model_output_2)

issue might be how you create your inputs.

@ArthurZucker
Copy link
Collaborator

Also the format you are using is wrong

@ShahRutav
Copy link
Author

ShahRutav commented Feb 28, 2024

#29327 (comment)
Thanks for experimenting with it.

  • I copied exactly the test script you attached, and I got the maximum difference in logits ~ 0.00145 (which is quite larger than yours)
  • This difference does not change by creating the inputs differently.

Unless you use a different transformer version, is this likely due to different system specifications?

@ArthurZucker
Copy link
Collaborator

I am using the latest release of transformers.

@ArthurZucker
Copy link
Collaborator

If there was a bug, it was fixe 😉

@ShahRutav
Copy link
Author

#29327 (comment) Thanks for experimenting with it.

  • I copied exactly the test script you attached, and I got the maximum difference in logits ~ 0.00145 (which is quite larger than yours)
  • This difference does not change by creating the inputs differently.

Unless you use a different transformer version, is this likely due to different system specifications?

I am also using the latest transformers from the main branch. I cannot explain the difference between your execution (~1e-5) and mine (~1e-3). I'm unsure if this is relevant, but I use CPU since the model won't fit in my GPU without quantization. I am avoiding quantization of the base model to isolate different sources of error.

@ArthurZucker
Copy link
Collaborator

1e-3 is already close enough. I don't know which CPU you are using but I think we can agree that the outputs are not batch dependent.

@ShahRutav
Copy link
Author

sounds good. I am closing the issue. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants