Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase batch size results in a non-linear increase in computation time #616

Open
Wonder1905 opened this issue Dec 22, 2024 · 0 comments
Open

Comments

@Wonder1905
Copy link

HI, I noticed that increasing the batch size train or inf results in a non-linear increase in computation time (we would except that linear increase would be an upper bound in some sense).
Ive saw it on my own environment, built another one and in the end tried also in the colab, here is the colab code:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import numpy as np
import time
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="float16", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="float16")
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "resized_height": 256, "resized_width": 256, "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "image", "resized_height": 256, "resized_width": 256, "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]


#messages = [messages1,messages1,messages1,messages1]
#messages = [messages1,messages1,messages1]
#messages = [messages1,messages1]
messages = [messages1]
# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

avg_list = []
for attempt in range(100):
    with torch.no_grad():
        torch.cuda.synchronize()
        start = time.perf_counter()  

        generated_ids = model(**inputs)
        torch.cuda.synchronize()
        end = time.perf_counter()
        avg_list.append(end-start)
        print(f"dotproduct time: {end - start}")
print("Avg of 100:",np.mean(avg_list))

Since colab can be tricky in allocating his resources, I did 10 run of 100 iterations and removed outliers the results were:
Batch=4: 1.4s
Batch=3: 0.88s
Batch=2: 0.51s
Batch=1: 0.19s
We can see the non linear increase, further when the size of the image increase it is much sharper.
After debugging, it happens in the visual:
self.visual = Qwen2VisionTransformerPretrainedModel._from_config(config.vision_config)

IN the scaled_dot_product:
attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)

And probably becuase you are treating the batch of images as a sequence and try to handle it with attention mask, but the seq length is biggest pain in transformers why is this the implementation?

Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant