When tp>1 vllm not work （Qwen2.5-VL-72B） #13124

ZhonghaoLu · 2025-02-12T01:48:34Z

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 911, in
uvloop.run(run_server(args))
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Could you please solve it? I seem to have encountered a similar error. This error will be reported when tp>1.

@ywang96 How can I solve it?

Originally posted by @ZhonghaoLu in #12604 (comment)

ZhonghaoLu · 2025-02-12T01:50:29Z

(qwen25vl) lzh@instance-aw1rhmsz-5:~/Code/qwen2_5vl/vllm$ python collect_env.py
/pfs/mt-hiEd6E/home/lzh/Code/qwen2_5vl/vllm/vllm/init.py:5: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from .version import version, version_tuple # isort:skip

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:31:09) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Stepping: 6
BogoMIPS: 5187.76
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 2.6 MiB (56 instances)
L1i cache: 1.8 MiB (56 instances)
L2 cache: 70 MiB (56 instances)
L3 cache: 96 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-55
NUMA node1 CPU(s): 56-111
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0.dev0
[pip3] triton==3.1.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-ml-py 12.570.86 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] pyzmq 26.2.1 pypi_0 pypi
[conda] torch 2.5.1 pypi_0 pypi
[conda] torchaudio 2.5.1 pypi_0 pypi
[conda] torchvision 0.20.1 pypi_0 pypi
[conda] transformers 4.49.0.dev0 pypi_0 pypi
[conda] triton 3.1.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A (dev)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS 56-111 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS 56-111 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS 56-111 1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS 56-111 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0

LD_LIBRARY_PATH=/pfs/mt-hiEd6E/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/cv2/../../lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

@ywang96 The above is the result of collect_env.py
thanks

ywang96 · 2025-02-12T02:50:53Z

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

LaoWangGB · 2025-02-12T03:24:29Z

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

When I use Qwen2.5-VL-72B-Instruct, there is a significant discrepancy between the inference results obtained using PyTorch and those using VLLM with same generation args.

ZhonghaoLu · 2025-02-12T05:32:32Z

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

(qwen25vl) lzh@instance-aw1rhmsz-5:~/Code/qwen2_5vl$ vllm serve /home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct --dtype auto --host 127.0.0.1 --port 6006 --tensor-parallel-size 4
INFO 02-12 13:31:03 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:04 api_server.py:840] vLLM API server version 0.7.2
INFO 02-12 13:31:04 api_server.py:841] args: Namespace(subparser='serve', model_tag='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', config='', host='127.0.0.1', port=6006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f4c79b4d800>)
INFO 02-12 13:31:04 api_server.py:206] Started engine process with PID 3458510
INFO 02-12 13:31:08 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:10 config.py:542] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 02-12 13:31:10 config.py:1401] Defaulting to use mp for distributed inference
WARNING 02-12 13:31:10 arg_utils.py:1145] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 02-12 13:31:15 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 02-12 13:31:15 config.py:1401] Defaulting to use mp for distributed inference
WARNING 02-12 13:31:15 arg_utils.py:1145] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 02-12 13:31:15 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', speculative_config=None, tokenizer='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 02-12 13:31:15 multiproc_worker_utils.py:300] Reducing Torch parallelism from 56 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-12 13:31:15 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 02-12 13:31:16 cuda.py:230] Using Flash Attention backend.
INFO 02-12 13:31:19 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:19 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:19 __init__.py:190] Automatically detected platform cuda.
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:20 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:20 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:20 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:20 cuda.py:230] Using Flash Attention backend.
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:20 cuda.py:230] Using Flash Attention backend.
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:21 cuda.py:230] Using Flash Attention backend.
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
Traceback (most recent call last):
  File "/home/lzh/anaconda3/envs/qwen25vl/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/scripts.py", line 204, in main
    args.dispatch_function(args)
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/scripts.py", line 44, in serve
    uvloop.run(run_server(args))
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 12 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

00058421 · 2025-02-13T09:10:22Z

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

When I use Qwen2.5-VL-72B-Instruct, there is a significant discrepancy between the inference results obtained using PyTorch and those using VLLM with same generation args.

@LaoWangGB I am facing the same issue here. Were you able to solve it?

LaoWangGB · 2025-02-14T04:00:59Z

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

When I use Qwen2.5-VL-72B-Instruct, there is a significant discrepancy between the inference results obtained using PyTorch and those using VLLM with same generation args.

@LaoWangGB I am facing the same issue here. Were you able to solve it?

set distributed_executor_backend="ray" while init llm. check sampling args in vllm, such as temperature, frequency_penalty, repetition_penalty...the diffs become small but still exist.

ZhonghaoLu changed the title ~~When tp>1 vllm not work~~ When tp>1 vllm not work （Qwen2.5-VL-72B） Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When tp>1 vllm not work （Qwen2.5-VL-72B） #13124

When tp>1 vllm not work （Qwen2.5-VL-72B） #13124

ZhonghaoLu commented Feb 12, 2025

ZhonghaoLu commented Feb 12, 2025

ywang96 commented Feb 12, 2025

LaoWangGB commented Feb 12, 2025

ZhonghaoLu commented Feb 12, 2025 •

edited by DarkLight1337

Loading

00058421 commented Feb 13, 2025

LaoWangGB commented Feb 14, 2025

When tp>1 vllm not work （Qwen2.5-VL-72B） #13124

When tp>1 vllm not work （Qwen2.5-VL-72B） #13124

Comments

ZhonghaoLu commented Feb 12, 2025

ZhonghaoLu commented Feb 12, 2025

ywang96 commented Feb 12, 2025

LaoWangGB commented Feb 12, 2025

ZhonghaoLu commented Feb 12, 2025 • edited by DarkLight1337 Loading

00058421 commented Feb 13, 2025

LaoWangGB commented Feb 14, 2025

ZhonghaoLu commented Feb 12, 2025 •

edited by DarkLight1337

Loading