Increase batch size results in a non-linear increase in computation time #616

Wonder1905 · 2024-12-22T09:16:48Z

HI, I noticed that increasing the batch size train or inf results in a non-linear increase in computation time (we would except that linear increase would be an upper bound in some sense).
Ive saw it on my own environment, built another one and in the end tried also in the colab, here is the colab code:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import numpy as np
import time
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="float16", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="float16")
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "resized_height": 256, "resized_width": 256, "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "image", "resized_height": 256, "resized_width": 256, "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]


#messages = [messages1,messages1,messages1,messages1]
#messages = [messages1,messages1,messages1]
#messages = [messages1,messages1]
messages = [messages1]
# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

avg_list = []
for attempt in range(100):
    with torch.no_grad():
        torch.cuda.synchronize()
        start = time.perf_counter()  

        generated_ids = model(**inputs)
        torch.cuda.synchronize()
        end = time.perf_counter()
        avg_list.append(end-start)
        print(f"dotproduct time: {end - start}")
print("Avg of 100:",np.mean(avg_list))

Since colab can be tricky in allocating his resources, I did 10 run of 100 iterations and removed outliers the results were:
Batch=4: 1.4s
Batch=3: 0.88s
Batch=2: 0.51s
Batch=1: 0.19s
We can see the non linear increase, further when the size of the image increase it is much sharper.
After debugging, it happens in the visual:
self.visual = Qwen2VisionTransformerPretrainedModel._from_config(config.vision_config)

IN the scaled_dot_product:
attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)

And probably becuase you are treating the batch of images as a sequence and try to handle it with attention mask, but the seq length is biggest pain in transformers why is this the implementation?

Am I missing something?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase batch size results in a non-linear increase in computation time #616

Increase batch size results in a non-linear increase in computation time #616

Wonder1905 commented Dec 22, 2024

Increase batch size results in a non-linear increase in computation time #616

Increase batch size results in a non-linear increase in computation time #616

Comments

Wonder1905 commented Dec 22, 2024