Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When tp>1 vllm not work (Qwen2.5-VL-72B) #13124

Open
ZhonghaoLu opened this issue Feb 12, 2025 · 6 comments
Open

When tp>1 vllm not work (Qwen2.5-VL-72B) #13124

ZhonghaoLu opened this issue Feb 12, 2025 · 6 comments

Comments

@ZhonghaoLu
Copy link

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 911, in
uvloop.run(run_server(args))
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Could you please solve it? I seem to have encountered a similar error. This error will be reported when tp>1.

@ywang96 How can I solve it?

Originally posted by @ZhonghaoLu in #12604 (comment)

@ZhonghaoLu
Copy link
Author

(qwen25vl) lzh@instance-aw1rhmsz-5:~/Code/qwen2_5vl/vllm$ python collect_env.py
/pfs/mt-hiEd6E/home/lzh/Code/qwen2_5vl/vllm/vllm/init.py:5: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from .version import version, version_tuple # isort:skip

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:31:09) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.7
/usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 112
On-line CPU(s) list: 0-111
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 2
Stepping: 6
BogoMIPS: 5187.76
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid md_clear arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 2.6 MiB (56 instances)
L1i cache: 1.8 MiB (56 instances)
L2 cache: 70 MiB (56 instances)
L3 cache: 96 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-55
NUMA node1 CPU(s): 56-111
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0.dev0
[pip3] triton==3.1.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-ml-py 12.570.86 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] pyzmq 26.2.1 pypi_0 pypi
[conda] torch 2.5.1 pypi_0 pypi
[conda] torchaudio 2.5.1 pypi_0 pypi
[conda] torchvision 0.20.1 pypi_0 pypi
[conda] transformers 4.49.0.dev0 pypi_0 pypi
[conda] triton 3.1.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A (dev)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS 0-55 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS 56-111 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS 56-111 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS 56-111 1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS 56-111 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0

LD_LIBRARY_PATH=/pfs/mt-hiEd6E/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/cv2/../../lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

@ywang96 The above is the result of collect_env.py
thanks

@ZhonghaoLu ZhonghaoLu changed the title When tp>1 vllm not work When tp>1 vllm not work (Qwen2.5-VL-72B) Feb 12, 2025
@ywang96
Copy link
Member

ywang96 commented Feb 12, 2025

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

@LaoWangGB
Copy link

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

When I use Qwen2.5-VL-72B-Instruct, there is a significant discrepancy between the inference results obtained using PyTorch and those using VLLM with same generation args.

@ZhonghaoLu
Copy link
Author

ZhonghaoLu commented Feb 12, 2025

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

(qwen25vl) lzh@instance-aw1rhmsz-5:~/Code/qwen2_5vl$ vllm serve /home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct --dtype auto --host 127.0.0.1 --port 6006 --tensor-parallel-size 4
INFO 02-12 13:31:03 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:04 api_server.py:840] vLLM API server version 0.7.2
INFO 02-12 13:31:04 api_server.py:841] args: Namespace(subparser='serve', model_tag='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', config='', host='127.0.0.1', port=6006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f4c79b4d800>)
INFO 02-12 13:31:04 api_server.py:206] Started engine process with PID 3458510
INFO 02-12 13:31:08 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:10 config.py:542] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 02-12 13:31:10 config.py:1401] Defaulting to use mp for distributed inference
WARNING 02-12 13:31:10 arg_utils.py:1145] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 02-12 13:31:15 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
INFO 02-12 13:31:15 config.py:1401] Defaulting to use mp for distributed inference
WARNING 02-12 13:31:15 arg_utils.py:1145] The model has a long context length (128000). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.
INFO 02-12 13:31:15 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', speculative_config=None, tokenizer='/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/lzh/llm_model/Qwen/Qwen2___5-VL-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 02-12 13:31:15 multiproc_worker_utils.py:300] Reducing Torch parallelism from 56 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-12 13:31:15 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 02-12 13:31:16 cuda.py:230] Using Flash Attention backend.
INFO 02-12 13:31:19 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:19 __init__.py:190] Automatically detected platform cuda.
INFO 02-12 13:31:19 __init__.py:190] Automatically detected platform cuda.
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:20 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:20 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:20 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:20 cuda.py:230] Using Flash Attention backend.
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:20 cuda.py:230] Using Flash Attention backend.
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:21 cuda.py:230] Using Flash Attention backend.
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:22 utils.py:950] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3459394) INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=3459393) INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=3459392) INFO 02-12 13:31:22 pynccl.py:69] vLLM is using nccl==2.21.5
Traceback (most recent call last):
  File "/home/lzh/anaconda3/envs/qwen25vl/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/scripts.py", line 204, in main
    args.dispatch_function(args)
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/scripts.py", line 44, in serve
    uvloop.run(run_server(args))
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lzh/anaconda3/envs/qwen25vl/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 12 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@00058421
Copy link

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

When I use Qwen2.5-VL-72B-Instruct, there is a significant discrepancy between the inference results obtained using PyTorch and those using VLLM with same generation args.

@LaoWangGB I am facing the same issue here. Were you able to solve it?

@LaoWangGB
Copy link

You might have to show the whole strak trace since the error you're showing is just the normal frontend server failure when the engine fails to start.

When I use Qwen2.5-VL-72B-Instruct, there is a significant discrepancy between the inference results obtained using PyTorch and those using VLLM with same generation args.

@LaoWangGB I am facing the same issue here. Were you able to solve it?

set distributed_executor_backend="ray" while init llm. check sampling args in vllm, such as temperature, frequency_penalty, repetition_penalty...the diffs become small but still exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants