Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VLM] Qwen2.5-VL #12604

Merged
merged 54 commits into from
Feb 5, 2025
Merged

[VLM] Qwen2.5-VL #12604

merged 54 commits into from
Feb 5, 2025

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Jan 31, 2025

FIXES: #12486, #12532

TODO:

To run this model before transformers 4.49 release, install transformers from source
pip install git+https://github.com/huggingface/transformers

Co-authored-by: @yixqiao(UC Berkeley) @wulipc(Qwen Team)

Signed-off-by: Roger Wang <ywang@roblox.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ywang96 ywang96 mentioned this pull request Jan 31, 2025
@DarkLight1337 DarkLight1337 self-assigned this Jan 31, 2025
Co-authored-by: Yixuan Qiao <yixqiao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
@mergify mergify bot added the frontend label Jan 31, 2025
@ywang96 ywang96 mentioned this pull request Jan 31, 2025
2 tasks
@mergify mergify bot added the v1 label Feb 1, 2025
yixqiao and others added 17 commits February 1, 2025 02:27
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 2, 2025
@kevin-ssy
Copy link

Can you show your code?

class QwenVL_VLLM:
    def __init__(self, llm_name='ckpts/Qwen2.5-VL-72B-Instruct', **llm_args):
        self.llm = LLM(
            model=llm_name,
            limit_mm_per_prompt={"image": 10, "video": 10},
            tensor_parallel_size=8,
            dtype='bfloat16',
            max_num_seqs=5,
            mm_processor_kwargs={
                "min_pixels": 28 * 28,
                "max_pixels": 1280 * 28 * 28,
                "fps": 1,
            },
            # disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
            **llm_args
        )
        self.sample_params = SamplingParams(temperature=0.2, max_tokens=512)
        # default processer
        self.processor = AutoProcessor.from_pretrained(llm_name, max_pixels=854 * 480)
        self.processor.tokenizer.padding_side = "left"

    def get_batch_messages(self, video_paths, queries, duration=1.0):
        messages = [
                [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "video",
                            "video": video_path,
                            "max_pixels": 360 * 420,
                            "fps": 1.0,
                        },
                        {"type": "text", "text": query},
                    ],
                }
            ] for video_path, query in zip(video_paths, queries)
        ]
        # return messages
        texts = [self.processor.apply_chat_template(
            msg, tokenize=False, add_generation_prompt=True) for msg in messages]
        image_inputs, video_inputs, video_kwargs = process_vision_info(
            messages, return_video_kwargs=True)
        return [{
            "prompt": query,
            "multi_modal_data": {
                "video": {
                    "data": v_input.numpy(),
                    "question": query,
                }
            },
        } for v_input, query in zip(video_inputs, texts)]
    
    
    def __call__(self, video_path, query, **kwargs):
        if isinstance(video_path, list) and isinstance(query, list):
            inputs = self.get_batch_messages(video_path, query)
        else:
            raise ValueError("video_path and query must be list or str")
        outputs = self.llm.generate(inputs, sampling_params=self.sample_params)
        return outputs

Sure. There you go!

@DarkLight1337
Copy link
Member

You should pass a numpy array directly to multi_modal_data.video instead of a nested dictionary. The query is already provided in prompt.

@kevin-ssy
Copy link

You should pass a numpy array directly to multi_modal_data.video instead of a nested dictionary. The query is already provided in prompt.

Just fixed. Brilliant thanks for your prompt reply!!!

@yfllllll
Copy link

yfllllll commented Feb 6, 2025

@rstone3017 ,have you solved it? i also met this problem

@xiayq1
Copy link

xiayq1 commented Feb 7, 2025

Can qwenvl2.5-7B run on V100?

  1. Tesla V100-SXM2-32GB

transformers 4.49.0.dev0
vllm 0.7.3.dev3+gc786e75.cu124
flash_attn 2.1.0
3.
vllm serve /Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype float16 --max-model-len 256

Meets:
....
e_size":256}, use_cached_outputs=True,
ERROR 02-07 15:16:36 utils.py:608] Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
ERROR 02-07 15:16:36 engine.py:389]
Traceback (most recent call last):
...
/miniconda3/envs/qwen25vl/lib/python3.10/site-packages/vllm/attention/backends/utils.py", line 611, in flash_attn_version
assert is_fa_version_supported(fa_version)
AssertionError
....
/site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

@linchen111
Copy link

how can I POST with local_image or local_video?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Feb 7, 2025

how can I POST with local_image or local_video?

You can set --allow-local-media-path in vllm serve and pass the file URL starting with file:// in the request

@DarkLight1337
Copy link
Member

DarkLight1337 commented Feb 7, 2025

Can qwenvl2.5-7B run on V100?

  1. Tesla V100-SXM2-32GB

transformers 4.49.0.dev0 vllm 0.7.3.dev3+gc786e75.cu124 flash_attn 2.1.0 3. vllm serve /Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype float16 --max-model-len 256

Meets: .... e_size":256}, use_cached_outputs=True, ERROR 02-07 15:16:36 utils.py:608] Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9 ERROR 02-07 15:16:36 engine.py:389] Traceback (most recent call last): ... /miniconda3/envs/qwen25vl/lib/python3.10/site-packages/vllm/attention/backends/utils.py", line 611, in flash_attn_version assert is_fa_version_supported(fa_version) AssertionError .... /site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.

This should be fixed by #12828, can you try using the latest code?

@MotorBottle
Copy link

I have build vllm and this branch from source and I do get the following error:

TypeError: Unknown image model type: qwen2_5_vl

I am serving the model as the following:

vllm serve Qwen/Qwen2.5-VL-72B-Instruct --quantization bitsandbytes --load-format bitsandbytes --pipeline_parallel_size 2 --max_model_len 10000

Were you able to run this model in bnb quantization? I tried to but failed. #12900 Could you provide any idea or instruction how to fix this? Appreciated

@ywang96
Copy link
Member Author

ywang96 commented Feb 7, 2025

I have build vllm and this branch from source and I do get the following error:
TypeError: Unknown image model type: qwen2_5_vl
I am serving the model as the following:
vllm serve Qwen/Qwen2.5-VL-72B-Instruct --quantization bitsandbytes --load-format bitsandbytes --pipeline_parallel_size 2 --max_model_len 10000

Were you able to run this model in bnb quantization? I tried to but failed. #12900 Could you provide any idea or instruction how to fix this? Appreciated

@MotorBottle I dont think this model is supported with bnb yet. See #12604 (comment)

@ransheng11
Copy link

@yfllllll have you solved it? i also met this problem

@Isotr0py
Copy link
Collaborator

Isotr0py commented Feb 8, 2025

@MotorBottle Can you try #12944? The BNB support for qwen2.5-vl should be added in that PR.

@hxujal
Copy link

hxujal commented Feb 8, 2025

how can I POST with local_image or local_video?

You can set --allow-local-media-path in vllm serve and pass the file URL starting with file:// in the request

Can you give a demo of the local image passed in?

@MotorBottle
Copy link

@MotorBottle Can you try #12944? The BNB support for qwen2.5-vl should be added in that PR.

Confirmed working with #12944. Qwen2.5-VL-7B-Instruct tested.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Feb 8, 2025

how can I POST with local_image or local_video?

You can set --allow-local-media-path in vllm serve and pass the file URL starting with file:// in the request

Can you give a demo of the local image passed in?

vllm serve <model> --allowed-local-media-path /path/to/data
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = client.models.list().data[0].id

chat_response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "file://path/to/data/path/to/image.jpg",
                    },
                },
                {"type": "text", "text": "What is in this image?"},
            ],
        }
    ],
)

@thiner
Copy link

thiner commented Feb 8, 2025

I am using the latest vllm-v0.7.2 docker image, but it failed to serve qwen2.5-vl-7b model. The error message:

ValueError: The checkpoint you are trying to load has model type `qwen2_5_vl` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

 

 You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

Seems the docker image was not built with the required transformer version.

@DarkLight1337
Copy link
Member

Yes, you need to manually install transformers from source as they haven't released this model yet.

@xiayq1
Copy link

xiayq1 commented Feb 8, 2025

I have build vllm and this branch from source and I do get the following error:
TypeError: Unknown image model type: qwen2_5_vl
I am serving the model as the following:
vllm serve Qwen/Qwen2.5-VL-72B-Instruct --quantization bitsandbytes --load-format bitsandbytes --pipeline_parallel_size 2 --max_model_len 10000

Were you able to run this model in bnb quantization? I tried to but failed. #12900 Could you provide any idea or instruction how to fix this? Appreciated

fixed, thx a lot

@fearnworks
Copy link

fearnworks commented Feb 9, 2025

I am hitting this issue when trying to run :

ERROR 02-09 13:48:41 core.py:210]     bin_counts.scatter_add_(1, tokens, torch.ones_like(tokens))
ERROR 02-09 13:48:41 core.py:210] RuntimeError: Expected index [5, 1943] to be smaller than self [4, 152065] apart from dimension 1 and to be smaller size than src [5, 1943]
ERROR 02-09 13:48:41 core.py:210] 
CRITICAL 02-09 13:48:41 core_client.py:158] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

with this script :

uv venv --python 3.12.8
source .venv/bin/activate
uv pip install vllm # ---> Now there is no need to install from source because of the latest release
uv pip install flash-attn --no-build-isolation # ---> Otherwise it will use xformers, or you can use flashinfer with uv pip install flashinfer-python
uv pip install "git+https://github.com/huggingface/transformers" # ---> This needs to be the last step, at least for now, once transformers release a new version, then you can just uv pip install transformers
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
    --port 12434 \
    --host 0.0.0.0 \
    --max-model-len 16434 \
    --dtype bfloat16 \
    --served-model-name vision-worker \
    --limit-mm-per-prompt image=1,video=0 

@ZhonghaoLu
Copy link

Can qwenvl2.5-7B run on V100?

  1. Tesla V100-SXM2-32GB

transformers 4.49.0.dev0 vllm 0.7.3.dev3+gc786e75.cu124 flash_attn 2.1.0 3. vllm serve /Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype float16 --max-model-len 256

Meets: .... e_size":256}, use_cached_outputs=True, ERROR 02-07 15:16:36 utils.py:608] Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9 ERROR 02-07 15:16:36 engine.py:389] Traceback (most recent call last): ... /miniconda3/envs/qwen25vl/lib/python3.10/site-packages/vllm/attention/backends/utils.py", line 611, in flash_attn_version assert is_fa_version_supported(fa_version) AssertionError .... /site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 911, in
uvloop.run(run_server(args))
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Could you please solve it? I seem to have encountered a similar error. This error will be reported when tp>1.

@ywang96 How can I solve it?

@DarkLight1337
Copy link
Member

I think this should be fixed by #12828 already, can you pull the latest code and try again?

@ZhonghaoLu
Copy link

I think this should be fixed by #12828 already, can you pull the latest code and try again?

Yes, I've pulled the latest code and tried it, I don't know what caused the bug to be stably triggered in tp>1, but there is no problem deploying the 7b model on a single card。

@DarkLight1337
Copy link
Member

Can you open a new issue and show your output of collect_env.py?

@jmtatsch
Copy link

Tried to do inference on qwen2.5-VL via vllm 0.7.2 and the current dev transformers but it get this import error:

ImportError: cannot import name 'Qwen2_5_VLImageProcessor' from 'transformers.models.qwen2_5_vl' (/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/init.py). Did you mean: 'Qwen2_5_VLProcessor'?

Am I doing something wrong or has transformers dev changed again?

@DarkLight1337
Copy link
Member

Transformers dev has changed. Please update vLLM and also your local version of the HF Hub repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: Qwen2.5-VL