-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: RuntimeError: CUDA error: invalid device ordinal with multi node multi gpus #3722
Comments
Hi, can you try to build from source with the latest main? #3686 should resolve your problem I think. BTW, when you use two nodes, do you use ray to set up the two nodes as a cluster? |
|
new vllm-build(0.4.0) comes new ERROR. ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.root@ai151:/# python3 -m vllm.entrypoints.api_server --model /models/openchat-3.5-0106/ --tensor-parallel-size 4 --dtype float16 --enforce-eagerWARNING 04-01 10:49:31 config.py:748] Casting torch.bfloat16 to torch.float16. 2024-04-01 10:49:31,995 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 10.4.80.151:6379... 2024-04-01 10:49:32,005 INFO worker.py:1743 -- Connected to Ray cluster. View the dashboard at 10.4.80.151:8265 INFO 04-01 10:49:32 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model=/models/openchat-3.5-0106/, tokenizer=/models/openchat-3.5-0106/, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (RayWorkerVllm pid=306, ip=10.4.80.152) INFO 04-01 10:49:42 selector.py:34] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-01 10:49:52 selector.py:34] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-01 10:49:52 selector.py:21] Using XFormers backend. (RayWorkerVllm pid=306, ip=10.4.80.152) INFO 04-01 10:49:42 selector.py:21] Using XFormers backend. (RayWorkerVllm pid=11716) INFO 04-01 10:49:53 pynccl_utils.py:45] vLLM is using nccl==2.18.1 INFO 04-01 10:49:54 pynccl_utils.py:45] vLLM is using nccl==2.18.1 (RayWorkerVllm pid=394, ip=10.4.80.152) Exception ignored in: (RayWorkerVllm pid=394, ip=10.4.80.152) Traceback (most recent call last): (RayWorkerVllm pid=394, ip=10.4.80.152) File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/model_executor/parallel_utils/pynccl.py, line 264, in __del__ (RayWorkerVllm pid=394, ip=10.4.80.152) _c_ncclCommDestroy(self.comm) (RayWorkerVllm pid=394, ip=10.4.80.152) AttributeError: NCCLCommunicator object has no attribute comm (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution. (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] Traceback (most recent call last): (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/engine/ray_utils.py, line 37, in execute_method (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] return executor(*args, **kwargs) (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/worker/worker.py, line 100, in init_device (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] init_distributed_environment(self.parallel_config, self.rank, (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/worker/worker.py, line 287, in init_distributed_environment (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] pynccl_utils.init_process_group( (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/model_executor/parallel_utils/pynccl_utils.py, line 46, in init_process_group (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] comm = NCCLCommunicator(init_method=init_method, (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg/vllm/model_executor/parallel_utils/pynccl.py, line 236, in __init__ (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] dist.broadcast(tensor, src=0) (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py, line 47, in wrapper (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] return func(*args, **kwargs) (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py, line 1906, in broadcast (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] work = default_pg.broadcast([tensor], opts) (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.1 (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] Last error: (RayWorkerVllm pid=394, ip=10.4.80.152) ERROR 04-01 10:49:46 ray_utils.py:44] socketProgressOpt: Call to recv from 10.4.80.152<57269> failed : Broken pipe here's build log, looks like good. vllm master(@563c1d7ec56aa0f9fdc28720f3517bf9297f5476) build logroot@7fc7fec8839f:/tmp/vllm# python3 setup.py installNo CUDA runtime is found, using CUDA_HOME=/usr/local/cuda running install /usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated. !! ******************************************************************************** Please avoid running ``setup.py`` directly. Instead, use pypa/build, pypa/installer or other standards-based tools. See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details. ******************************************************************************** !! self.initialize_options() /usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated. !! ******************************************************************************** Please avoid running ``setup.py`` and ``easy_install``. Instead, use pypa/build, pypa/installer or other standards-based tools. See https://github.com/pypa/setuptools/issues/917 for details. ******************************************************************************** !! self.initialize_options() running bdist_egg running egg_info writing vllm.egg-info/PKG-INFO writing dependency_links to vllm.egg-info/dependency_links.txt writing requirements to vllm.egg-info/requires.txt writing top-level names to vllm.egg-info/top_level.txt reading manifest file vllm.egg-info/SOURCES.txt reading manifest template MANIFEST.in adding license file LICENSE writing manifest file vllm.egg-info/SOURCES.txt installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py running build_ext -- Build type: RelWithDebInfo -- Found python matching: /usr/bin/python3. -- Caffe2: CUDA detected: 12.1 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda -- Caffe2: Header version is: 12.1 -- /usr/local/cuda/lib64/libnvrtc.so shorthash is b51b459d -- USE_CUDNN is set to 0. Compiling without cuDNN support -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support -- Automatic GPU detection failed. Building for common architectures. -- Autodetected CUDA architecture(s): 3.5;5.0;8.0;8.6;8.9;9.0 -- Added CUDA NVCC flags for: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90 CMake Warning at /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): static library kineto_LIBRARY-NOTFOUND not found. Call Stack (most recent call first): /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found) CMakeLists.txt:64 (find_package) -- CUDA supported arches: 7.0;7.5;8.0;8.6;8.9;9.0 -- discarding unsupported CUDA arch 3.10. -- discarding unsupported CUDA arch 3.10. -- CUDA target arches: 80-real;86-real;89-real;90-real -- Punica target arches: 80-real;86-real;89-real;90-real -- Enabling C extension. -- Enabling moe extension. -- Configuring done (8.2s) -- Generating done (0.0s) -- Build files have been written to: /tmp/vllm/build/temp.linux-x86_64-cpython-310 [3/3] Linking CXX shared module /tmp/vllm/build/lib.linux-x86_64-cpython-310/vllm/_moe_C.cpython-310-x86_64-linux-gnu.so [6/14] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/awq/gemm_kernels.cu.o /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced int j_factors1 = 4; ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced int row_stride2 = 4; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced int split_k_iters = 1; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced half B_shared_warp[32]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced int OC = 512; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced half scaling_factors_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced half zeros_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced int blockIdx_x = 0; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced __pack_half2(const half x, const half y) { ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced int j_factors1 = 4; ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced int row_stride2 = 4; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced int split_k_iters = 1; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced half B_shared_warp[32]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced int OC = 512; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced half scaling_factors_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced half zeros_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced int blockIdx_x = 0; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced __pack_half2(const half x, const half y) { ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced int j_factors1 = 4; ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced int row_stride2 = 4; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced int split_k_iters = 1; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced half B_shared_warp[32]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced int OC = 512; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced half scaling_factors_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced half zeros_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced int blockIdx_x = 0; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced __pack_half2(const half x, const half y) { ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(282): warning #177-D: variable j_factors1 was declared but never referenced int j_factors1 = 4; ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(283): warning #177-D: variable row_stride2 was declared but never referenced int row_stride2 = 4; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(284): warning #177-D: variable split_k_iters was declared but never referenced int split_k_iters = 1; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(290): warning #177-D: variable B_shared_warp was declared but never referenced half B_shared_warp[32]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(291): warning #177-D: variable OC was declared but never referenced int OC = 512; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(53): warning #177-D: variable scaling_factors_shared was declared but never referenced half scaling_factors_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(54): warning #177-D: variable zeros_shared was declared but never referenced half zeros_shared[N]; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(57): warning #177-D: variable blockIdx_x was declared but never referenced int blockIdx_x = 0; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(71): warning #177-D: variable ld_zero_flag was declared but never referenced bool ld_zero_flag = (threadIdx.y * 32 + threadIdx.x) * 8 < N; ^ /tmp/vllm/csrc/quantization/awq/gemm_kernels.cu(24): warning #177-D: function vllm::awq::__pack_half2 was declared but never referenced __pack_half2(const half x, const half y) { ^ [7/14] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/squeezellm/quant_cuda_kernel.cu.o /tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu: In function ‘void squeezellm_gemm(at::Tensor, at::Tensor, at::Tensor, at::Tensor)’: /tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu:206:136: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 206 | vllm::squeezellm::NUQ4MatMulKernel<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here 247 | T * data() const { | ^ ~~ /tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu:206:193: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 206 | vllm::squeezellm::NUQ4MatMulKernel<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here 247 | T * data() const { | ^ ~~ /tmp/vllm/csrc/quantization/squeezellm/quant_cuda_kernel.cu:206:237: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations] 206 | vllm::squeezellm::NUQ4MatMulKernel<<>>( | ^ /usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:247:1: note: declared here 247 | T * data() const { | ^ ~~ [12/14] Building CUDA object CMakeFiles/_C.dir/csrc/quantization/marlin/marlin_cuda_kernel.cu.o /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1033 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=1, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=2, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=3, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=256, thread_m_blocks=4, thread_n_blocks=16, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1034 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=8, thread_k_blocks=4, stages=4, group_blocks=-1] at line 1035 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=1, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=2, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=3, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(455): warning #179-D: right operand of % is zero if (group_blocks != -1 && pipe % (group_blocks / thread_k_blocks) == 0) { ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 /tmp/vllm/csrc/quantization/marlin/marlin_cuda_kernel.cu(487): warning #39-D: division by zero (pipe / (group_blocks / thread_k_blocks))); ^ detected during instantiation of void marlin::Marlin(const int4 *, const int4 *, int4 *, const int4 *, int, int, int, int *) [with threads=128, thread_m_blocks=4, thread_n_blocks=4, thread_k_blocks=8, stages=4, group_blocks=-1] at line 1036 [13/14] Building CUDA object CMakeFiles/_C.dir/csrc/attention/attention_kernels.cu.o /tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ /tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ /tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ /tmp/vllm/csrc/attention/attention_kernels.cu(625): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ Remark: The warnings can be suppressed with -diag-suppress /tmp/vllm/csrc/attention/attention_kernels.cu(806): warning #177-D: variable thread_group_size was declared but never referenced int thread_group_size = ((32 / BLOCK_SIZE) > (1) ? (32 / BLOCK_SIZE) : (1)); ^ [14/14] Linking CXX shared module /tmp/vllm/build/lib.linux-x86_64-cpython-310/vllm/_C.cpython-310-x86_64-linux-gnu.so creating build/bdist.linux-x86_64 creating build/bdist.linux-x86_64/egg creating build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/test_utils.py -> build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/utils.py -> build/bdist.linux-x86_64/egg/vllm creating build/bdist.linux-x86_64/egg/vllm/attention copying build/lib.linux-x86_64-cpython-310/vllm/attention/layer.py -> build/bdist.linux-x86_64/egg/vllm/attention copying build/lib.linux-x86_64-cpython-310/vllm/attention/__init__.py -> build/bdist.linux-x86_64/egg/vllm/attention creating build/bdist.linux-x86_64/egg/vllm/attention/ops copying build/lib.linux-x86_64-cpython-310/vllm/attention/ops/paged_attn.py -> build/bdist.linux-x86_64/egg/vllm/attention/ops copying build/lib.linux-x86_64-cpython-310/vllm/attention/ops/prefix_prefill.py -> build/bdist.linux-x86_64/egg/vllm/attention/ops copying build/lib.linux-x86_64-cpython-310/vllm/attention/ops/__init__.py -> build/bdist.linux-x86_64/egg/vllm/attention/ops copying build/lib.linux-x86_64-cpython-310/vllm/attention/selector.py -> build/bdist.linux-x86_64/egg/vllm/attention creating build/bdist.linux-x86_64/egg/vllm/attention/backends copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/flash_attn.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/xformers.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/__init__.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends copying build/lib.linux-x86_64-cpython-310/vllm/attention/backends/abstract.py -> build/bdist.linux-x86_64/egg/vllm/attention/backends creating build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/worker/neuron_model_runner.py -> build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/worker/worker.py -> build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/worker/cache_engine.py -> build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/worker/model_runner.py -> build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/worker/__init__.py -> build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/worker/neuron_worker.py -> build/bdist.linux-x86_64/egg/vllm/worker copying build/lib.linux-x86_64-cpython-310/vllm/py.typed -> build/bdist.linux-x86_64/egg/vllm creating build/bdist.linux-x86_64/egg/vllm/transformers_utils copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils creating build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/chatglm.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/jais.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/mpt.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/dbrx.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/configs/falcon.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/detokenizer.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/config.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils creating build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizers/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizers/baichuan.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers creating build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/tokenizer_group.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/__init__.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group copying build/lib.linux-x86_64-cpython-310/vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py -> build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group creating build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/models.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/utils.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/worker_manager.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/request.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/__init__.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/punica.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/lora.py -> build/bdist.linux-x86_64/egg/vllm/lora copying build/lib.linux-x86_64-cpython-310/vllm/lora/layers.py -> build/bdist.linux-x86_64/egg/vllm/lora creating build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/block_manager_v1.py -> build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/evictor.py -> build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/block_manager_v2.py -> build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/scheduler.py -> build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/__init__.py -> build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/policy.py -> build/bdist.linux-x86_64/egg/vllm/core copying build/lib.linux-x86_64-cpython-310/vllm/core/interfaces.py -> build/bdist.linux-x86_64/egg/vllm/core creating build/bdist.linux-x86_64/egg/vllm/model_executor creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/sampler.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/layernorm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=2688,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-40GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-40GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=1344,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=16,N=2688,device_name=NVIDIA_H100_80GB_HBM3.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/configs/E=8,N=3584,device_name=NVIDIA_A100-SXM4-40GB.json -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/configs copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/fused_moe/fused_moe.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/vocab_parallel_embedding.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/rotary_embedding.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/linear.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/squeezellm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/gptq.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/marlin.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/base_config.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/quantization/awq.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers creating build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/ops/sample.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/ops/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/ops/rand.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/rejection_sampler.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/logits_processor.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/layers/activation.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/layers copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/sampling_metadata.py -> build/bdist.linux-x86_64/egg/vllm/model_executor copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/model_loader.py -> build/bdist.linux-x86_64/egg/vllm/model_executor copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/guided_decoding.py -> build/bdist.linux-x86_64/egg/vllm/model_executor copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/guided_logits_processors.py -> build/bdist.linux-x86_64/egg/vllm/model_executor creating build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/parallel_state.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/custom_all_reduce.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/communication_op.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/pynccl.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/parallel_utils/pynccl_utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/weight_utils.py -> build/bdist.linux-x86_64/egg/vllm/model_executor creating build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/bloom.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/orion.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/qwen2_moe.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/decilm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt_bigcode.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/qwen2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/starcoder2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/opt.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/mixtral_quant.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/olmo.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/__init__.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gemma.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/phi.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/llama.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt_j.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/chatglm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/commandr.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/xverse.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/stablelm.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/jais.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/internlm2.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/mixtral.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/qwen.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/deepseek.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/mpt.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/gpt_neox.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/llava.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/dbrx.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/baichuan.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/models/falcon.py -> build/bdist.linux-x86_64/egg/vllm/model_executor/models copying build/lib.linux-x86_64-cpython-310/vllm/model_executor/neuron_model_loader.py -> build/bdist.linux-x86_64/egg/vllm/model_executor creating build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/executor/utils.py -> build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/executor/ray_gpu_executor.py -> build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/executor/executor_base.py -> build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/executor/gpu_executor.py -> build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/executor/neuron_executor.py -> build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/executor/__init__.py -> build/bdist.linux-x86_64/egg/vllm/executor copying build/lib.linux-x86_64-cpython-310/vllm/sampling_params.py -> build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/__init__.py -> build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/config.py -> build/bdist.linux-x86_64/egg/vllm creating build/bdist.linux-x86_64/egg/vllm/entrypoints copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/llm.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/__init__.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/api_server.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints creating build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/serving_completion.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/serving_chat.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/__init__.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/api_server.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/protocol.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/cli_args.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/entrypoints/openai/serving_engine.py -> build/bdist.linux-x86_64/egg/vllm/entrypoints/openai copying build/lib.linux-x86_64-cpython-310/vllm/outputs.py -> build/bdist.linux-x86_64/egg/vllm creating build/bdist.linux-x86_64/egg/vllm/usage copying build/lib.linux-x86_64-cpython-310/vllm/usage/usage_lib.py -> build/bdist.linux-x86_64/egg/vllm/usage copying build/lib.linux-x86_64-cpython-310/vllm/usage/__init__.py -> build/bdist.linux-x86_64/egg/vllm/usage copying build/lib.linux-x86_64-cpython-310/vllm/block.py -> build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/logger.py -> build/bdist.linux-x86_64/egg/vllm creating build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/engine/async_llm_engine.py -> build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/engine/arg_utils.py -> build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/engine/metrics.py -> build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/engine/__init__.py -> build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/engine/ray_utils.py -> build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/engine/llm_engine.py -> build/bdist.linux-x86_64/egg/vllm/engine copying build/lib.linux-x86_64-cpython-310/vllm/sequence.py -> build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/_moe_C.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/vllm copying build/lib.linux-x86_64-cpython-310/vllm/_C.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/vllm creating build/bdist.linux-x86_64/egg/tests creating build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_utils.py -> build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/utils.py -> build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_batch_expansion.py -> build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_metrics.py -> build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/__init__.py -> build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_spec_decode_worker.py -> build/bdist.linux-x86_64/egg/tests/spec_decode copying build/lib.linux-x86_64-cpython-310/tests/spec_decode/test_multi_step_worker.py -> build/bdist.linux-x86_64/egg/tests/spec_decode creating build/bdist.linux-x86_64/egg/tests/worker copying build/lib.linux-x86_64-cpython-310/tests/worker/test_swap.py -> build/bdist.linux-x86_64/egg/tests/worker copying build/lib.linux-x86_64-cpython-310/tests/worker/__init__.py -> build/bdist.linux-x86_64/egg/tests/worker copying build/lib.linux-x86_64-cpython-310/tests/worker/test_model_runner.py -> build/bdist.linux-x86_64/egg/tests/worker creating build/bdist.linux-x86_64/egg/tests/tokenization copying build/lib.linux-x86_64-cpython-310/tests/tokenization/test_tokenizer_group.py -> build/bdist.linux-x86_64/egg/tests/tokenization copying build/lib.linux-x86_64-cpython-310/tests/tokenization/test_detokenize.py -> build/bdist.linux-x86_64/egg/tests/tokenization copying build/lib.linux-x86_64-cpython-310/tests/tokenization/__init__.py -> build/bdist.linux-x86_64/egg/tests/tokenization copying build/lib.linux-x86_64-cpython-310/tests/tokenization/test_cached_tokenizer.py -> build/bdist.linux-x86_64/egg/tests/tokenization creating build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_tokenizer_group.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_utils.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_lora.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/utils.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_layers.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_mixtral.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_chatglm3.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_lora_manager.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/conftest.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_llama.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_worker.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_punica.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/__init__.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_baichuan.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_layer_variation.py -> build/bdist.linux-x86_64/egg/tests/lora copying build/lib.linux-x86_64-cpython-310/tests/lora/test_gemma.py -> build/bdist.linux-x86_64/egg/tests/lora creating build/bdist.linux-x86_64/egg/tests/core copying build/lib.linux-x86_64-cpython-310/tests/core/utils.py -> build/bdist.linux-x86_64/egg/tests/core copying build/lib.linux-x86_64-cpython-310/tests/core/test_block_manager.py -> build/bdist.linux-x86_64/egg/tests/core copying build/lib.linux-x86_64-cpython-310/tests/core/test_scheduler.py -> build/bdist.linux-x86_64/egg/tests/core creating build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_naive_block.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_cpu_gpu_block_allocator.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_common.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_prefix_caching_block.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/__init__.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_block_table.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/block/test_block_space_manager.py -> build/bdist.linux-x86_64/egg/tests/core/block copying build/lib.linux-x86_64-cpython-310/tests/core/__init__.py -> build/bdist.linux-x86_64/egg/tests/core byte-compiling build/bdist.linux-x86_64/egg/vllm/test_utils.py to test_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/layer.py to layer.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/ops/paged_attn.py to paged_attn.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/ops/prefix_prefill.py to prefix_prefill.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/ops/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/selector.py to selector.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/flash_attn.py to flash_attn.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/xformers.py to xformers.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/attention/backends/abstract.py to abstract.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/neuron_model_runner.py to neuron_model_runner.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/worker.py to worker.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/cache_engine.py to cache_engine.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/model_runner.py to model_runner.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/worker/neuron_worker.py to neuron_worker.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer.py to tokenizer.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/chatglm.py to chatglm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/jais.py to jais.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/mpt.py to mpt.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/dbrx.py to dbrx.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/configs/falcon.py to falcon.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/detokenizer.py to detokenizer.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/config.py to config.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizers/baichuan.py to baichuan.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/ray_tokenizer_group.py to ray_tokenizer_group.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/tokenizer_group.py to tokenizer_group.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/transformers_utils/tokenizer_group/base_tokenizer_group.py to base_tokenizer_group.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/models.py to models.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/worker_manager.py to worker_manager.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/request.py to request.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/punica.py to punica.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/lora.py to lora.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/lora/layers.py to layers.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/block_manager_v1.py to block_manager_v1.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/evictor.py to evictor.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/block_manager_v2.py to block_manager_v2.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/scheduler.py to scheduler.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/policy.py to policy.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/core/interfaces.py to interfaces.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/sampler.py to sampler.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/layernorm.py to layernorm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/fused_moe/fused_moe.py to fused_moe.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/vocab_parallel_embedding.py to vocab_parallel_embedding.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/rotary_embedding.py to rotary_embedding.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/linear.py to linear.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/squeezellm.py to squeezellm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/gptq.py to gptq.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/marlin.py to marlin.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/base_config.py to base_config.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/quantization/awq.py to awq.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops/sample.py to sample.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/ops/rand.py to rand.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/rejection_sampler.py to rejection_sampler.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/logits_processor.py to logits_processor.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/layers/activation.py to activation.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/sampling_metadata.py to sampling_metadata.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/model_loader.py to model_loader.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/guided_decoding.py to guided_decoding.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/guided_logits_processors.py to guided_logits_processors.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/parallel_state.py to parallel_state.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/custom_all_reduce.py to custom_all_reduce.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/communication_op.py to communication_op.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/pynccl.py to pynccl.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/parallel_utils/pynccl_utils.py to pynccl_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/weight_utils.py to weight_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/bloom.py to bloom.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/orion.py to orion.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/qwen2_moe.py to qwen2_moe.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/decilm.py to decilm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt2.py to gpt2.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt_bigcode.py to gpt_bigcode.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/qwen2.py to qwen2.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/starcoder2.py to starcoder2.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/opt.py to opt.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/mixtral_quant.py to mixtral_quant.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/olmo.py to olmo.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gemma.py to gemma.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/phi.py to phi.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/llama.py to llama.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt_j.py to gpt_j.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/chatglm.py to chatglm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/commandr.py to commandr.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/xverse.py to xverse.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/stablelm.py to stablelm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/jais.py to jais.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/internlm2.py to internlm2.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/mixtral.py to mixtral.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/qwen.py to qwen.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/deepseek.py to deepseek.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/mpt.py to mpt.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/gpt_neox.py to gpt_neox.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/llava.py to llava.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/dbrx.py to dbrx.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/baichuan.py to baichuan.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/models/falcon.py to falcon.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/model_executor/neuron_model_loader.py to neuron_model_loader.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/ray_gpu_executor.py to ray_gpu_executor.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/executor_base.py to executor_base.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/gpu_executor.py to gpu_executor.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/neuron_executor.py to neuron_executor.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/executor/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/sampling_params.py to sampling_params.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/config.py to config.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/llm.py to llm.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/api_server.py to api_server.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/serving_completion.py to serving_completion.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/serving_chat.py to serving_chat.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/api_server.py to api_server.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/protocol.py to protocol.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/cli_args.py to cli_args.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/entrypoints/openai/serving_engine.py to serving_engine.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/outputs.py to outputs.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/usage/usage_lib.py to usage_lib.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/usage/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/block.py to block.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/logger.py to logger.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/async_llm_engine.py to async_llm_engine.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/arg_utils.py to arg_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/metrics.py to metrics.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/ray_utils.py to ray_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/engine/llm_engine.py to llm_engine.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/sequence.py to sequence.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_utils.py to test_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_batch_expansion.py to test_batch_expansion.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_metrics.py to test_metrics.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_spec_decode_worker.py to test_spec_decode_worker.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/spec_decode/test_multi_step_worker.py to test_multi_step_worker.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/worker/test_swap.py to test_swap.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/worker/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/worker/test_model_runner.py to test_model_runner.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/test_tokenizer_group.py to test_tokenizer_group.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/test_detokenize.py to test_detokenize.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/tokenization/test_cached_tokenizer.py to test_cached_tokenizer.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_tokenizer_group.py to test_tokenizer_group.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_utils.py to test_utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_lora.py to test_lora.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_layers.py to test_layers.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_mixtral.py to test_mixtral.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_chatglm3.py to test_chatglm3.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_lora_manager.py to test_lora_manager.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/conftest.py to conftest.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_llama.py to test_llama.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_worker.py to test_worker.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_punica.py to test_punica.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_baichuan.py to test_baichuan.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_layer_variation.py to test_layer_variation.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/lora/test_gemma.py to test_gemma.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/utils.py to utils.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/test_block_manager.py to test_block_manager.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/test_scheduler.py to test_scheduler.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_naive_block.py to test_naive_block.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_cpu_gpu_block_allocator.py to test_cpu_gpu_block_allocator.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_common.py to test_common.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_prefix_caching_block.py to test_prefix_caching_block.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/__init__.py to __init__.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_block_table.py to test_block_table.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/block/test_block_space_manager.py to test_block_space_manager.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/tests/core/__init__.py to __init__.cpython-310.pyc creating stub loader for vllm/_moe_C.cpython-310-x86_64-linux-gnu.so creating stub loader for vllm/_C.cpython-310-x86_64-linux-gnu.so byte-compiling build/bdist.linux-x86_64/egg/vllm/_moe_C.py to _moe_C.cpython-310.pyc byte-compiling build/bdist.linux-x86_64/egg/vllm/_C.py to _C.cpython-310.pyc creating build/bdist.linux-x86_64/egg/EGG-INFO copying vllm.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO copying vllm.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying vllm.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying vllm.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying vllm.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt zip_safe flag not set; analyzing archive contents... vllm.__pycache__._C.cpython-310: module references __file__ vllm.__pycache__._moe_C.cpython-310: module references __file__ vllm.model_executor.layers.fused_moe.__pycache__.fused_moe.cpython-310: module references __file__ creating dist creating dist/vllm-0.4.0-py3.10-linux-x86_64.egg and adding build/bdist.linux-x86_64/egg to it removing build/bdist.linux-x86_64/egg (and everything under it) Processing vllm-0.4.0-py3.10-linux-x86_64.egg creating /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg Extracting vllm-0.4.0-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages Adding vllm 0.4.0 to easy-install.pth file Installed /usr/local/lib/python3.10/dist-packages/vllm-0.4.0-py3.10-linux-x86_64.egg Processing dependencies for vllm==0.4.0 Searching for tiktoken==0.6.0 Best match: tiktoken 0.6.0 Adding tiktoken 0.6.0 to easy-install.pth file detected new path ./vllm-0.4.0-py3.10-linux-x86_64.egg Using /usr/local/lib/python3.10/dist-packages Searching for outlines==0.0.34 Best match: outlines 0.0.34 Adding outlines 0.0.34 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for triton==2.1.0 Best match: triton 2.1.0 Adding triton 2.1.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for pynvml==11.5.0 Best match: pynvml 11.5.0 Adding pynvml 11.5.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for prometheus-client==0.20.0 Best match: prometheus-client 0.20.0 Adding prometheus-client 0.20.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for pydantic==2.6.4 Best match: pydantic 2.6.4 Adding pydantic 2.6.4 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for uvicorn==0.29.0 Best match: uvicorn 0.29.0 Adding uvicorn 0.29.0 to easy-install.pth file Installing uvicorn script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for fastapi==0.110.0 Best match: fastapi 0.110.0 Adding fastapi 0.110.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for xformers==0.0.23.post1 Best match: xformers 0.0.23.post1 Adding xformers 0.0.23.post1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for transformers==4.39.2 Best match: transformers 4.39.2 Adding transformers 4.39.2 to easy-install.pth file Installing transformers-cli script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for py-cpuinfo==9.0.0 Best match: py-cpuinfo 9.0.0 Adding py-cpuinfo 9.0.0 to easy-install.pth file Installing cpuinfo script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for psutil==5.9.8 Best match: psutil 5.9.8 Adding psutil 5.9.8 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for requests==2.31.0 Best match: requests 2.31.0 Adding requests 2.31.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for torch==2.1.2 Best match: torch 2.1.2 Adding torch 2.1.2 to easy-install.pth file Installing convert-caffe2-to-onnx script to /usr/local/bin Installing convert-onnx-to-caffe2 script to /usr/local/bin Installing torchrun script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for numpy==1.26.4 Best match: numpy 1.26.4 Adding numpy 1.26.4 to easy-install.pth file Installing f2py script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for sentencepiece==0.2.0 Best match: sentencepiece 0.2.0 Adding sentencepiece 0.2.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for ray==2.10.0 Best match: ray 2.10.0 Adding ray 2.10.0 to easy-install.pth file Installing ray script to /usr/local/bin Installing rllib script to /usr/local/bin Installing serve script to /usr/local/bin Installing tune script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for ninja==1.11.1.1 Best match: ninja 1.11.1.1 Adding ninja 1.11.1.1 to easy-install.pth file Installing ninja script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for cmake==3.29.0.1 Best match: cmake 3.29.0.1 Adding cmake 3.29.0.1 to easy-install.pth file Installing cmake script to /usr/local/bin Installing cpack script to /usr/local/bin Installing ctest script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for regex==2023.12.25 Best match: regex 2023.12.25 Adding regex 2023.12.25 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for jsonschema==4.21.1 Best match: jsonschema 4.21.1 Adding jsonschema 4.21.1 to easy-install.pth file Installing jsonschema script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for referencing==0.34.0 Best match: referencing 0.34.0 Adding referencing 0.34.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for joblib==1.3.2 Best match: joblib 1.3.2 Adding joblib 1.3.2 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for numba==0.59.1 Best match: numba 0.59.1 Adding numba 0.59.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for scipy==1.12.0 Best match: scipy 1.12.0 Adding scipy 1.12.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for diskcache==5.6.3 Best match: diskcache 5.6.3 Adding diskcache 5.6.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for cloudpickle==3.0.0 Best match: cloudpickle 3.0.0 Adding cloudpickle 3.0.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nest-asyncio==1.6.0 Best match: nest-asyncio 1.6.0 Adding nest-asyncio 1.6.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for lark==1.1.9 Best match: lark 1.1.9 Adding lark 1.1.9 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for Jinja2==3.1.3 Best match: Jinja2 3.1.3 Adding Jinja2 3.1.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for interegular==0.3.3 Best match: interegular 0.3.3 Adding interegular 0.3.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for filelock==3.13.3 Best match: filelock 3.13.3 Adding filelock 3.13.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for typing-extensions==4.10.0 Best match: typing-extensions 4.10.0 Adding typing-extensions 4.10.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for pydantic-core==2.16.3 Best match: pydantic-core 2.16.3 Adding pydantic-core 2.16.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for annotated-types==0.6.0 Best match: annotated-types 0.6.0 Adding annotated-types 0.6.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for websockets==12.0 Best match: websockets 12.0 Adding websockets 12.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for watchfiles==0.21.0 Best match: watchfiles 0.21.0 Adding watchfiles 0.21.0 to easy-install.pth file Installing watchfiles script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for uvloop==0.19.0 Best match: uvloop 0.19.0 Adding uvloop 0.19.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for PyYAML==6.0.1 Best match: PyYAML 6.0.1 Adding PyYAML 6.0.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for python-dotenv==1.0.1 Best match: python-dotenv 1.0.1 Adding python-dotenv 1.0.1 to easy-install.pth file Installing dotenv script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for httptools==0.6.1 Best match: httptools 0.6.1 Adding httptools 0.6.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for h11==0.14.0 Best match: h11 0.14.0 Adding h11 0.14.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for click==8.1.7 Best match: click 8.1.7 Adding click 8.1.7 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for starlette==0.36.3 Best match: starlette 0.36.3 Adding starlette 0.36.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for tqdm==4.66.2 Best match: tqdm 4.66.2 Adding tqdm 4.66.2 to easy-install.pth file Installing tqdm script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for safetensors==0.4.2 Best match: safetensors 0.4.2 Adding safetensors 0.4.2 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for tokenizers==0.15.2 Best match: tokenizers 0.15.2 Adding tokenizers 0.15.2 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for packaging==24.0 Best match: packaging 24.0 Adding packaging 24.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for huggingface-hub==0.21.4 Best match: huggingface-hub 0.21.4 Adding huggingface-hub 0.21.4 to easy-install.pth file Installing huggingface-cli script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for certifi==2024.2.2 Best match: certifi 2024.2.2 Adding certifi 2024.2.2 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for urllib3==2.2.1 Best match: urllib3 2.2.1 Adding urllib3 2.2.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for idna==3.6 Best match: idna 3.6 Adding idna 3.6 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for charset-normalizer==3.3.2 Best match: charset-normalizer 3.3.2 Adding charset-normalizer 3.3.2 to easy-install.pth file Installing normalizer script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-nvtx-cu12==12.1.105 Best match: nvidia-nvtx-cu12 12.1.105 Adding nvidia-nvtx-cu12 12.1.105 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-nccl-cu12==2.18.1 Best match: nvidia-nccl-cu12 2.18.1 Adding nvidia-nccl-cu12 2.18.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cusparse-cu12==12.1.0.106 Best match: nvidia-cusparse-cu12 12.1.0.106 Adding nvidia-cusparse-cu12 12.1.0.106 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cusolver-cu12==11.4.5.107 Best match: nvidia-cusolver-cu12 11.4.5.107 Adding nvidia-cusolver-cu12 11.4.5.107 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-curand-cu12==10.3.2.106 Best match: nvidia-curand-cu12 10.3.2.106 Adding nvidia-curand-cu12 10.3.2.106 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cufft-cu12==11.0.2.54 Best match: nvidia-cufft-cu12 11.0.2.54 Adding nvidia-cufft-cu12 11.0.2.54 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cublas-cu12==12.1.3.1 Best match: nvidia-cublas-cu12 12.1.3.1 Adding nvidia-cublas-cu12 12.1.3.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cudnn-cu12==8.9.2.26 Best match: nvidia-cudnn-cu12 8.9.2.26 Adding nvidia-cudnn-cu12 8.9.2.26 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cuda-cupti-cu12==12.1.105 Best match: nvidia-cuda-cupti-cu12 12.1.105 Adding nvidia-cuda-cupti-cu12 12.1.105 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cuda-runtime-cu12==12.1.105 Best match: nvidia-cuda-runtime-cu12 12.1.105 Adding nvidia-cuda-runtime-cu12 12.1.105 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-cuda-nvrtc-cu12==12.1.105 Best match: nvidia-cuda-nvrtc-cu12 12.1.105 Adding nvidia-cuda-nvrtc-cu12 12.1.105 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for fsspec==2024.3.1 Best match: fsspec 2024.3.1 Adding fsspec 2024.3.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for networkx==3.2.1 Best match: networkx 3.2.1 Adding networkx 3.2.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for sympy==1.12 Best match: sympy 1.12 Adding sympy 1.12 to easy-install.pth file Installing isympy script to /usr/local/bin Using /usr/local/lib/python3.10/dist-packages Searching for frozenlist==1.4.1 Best match: frozenlist 1.4.1 Adding frozenlist 1.4.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for aiosignal==1.3.1 Best match: aiosignal 1.3.1 Adding aiosignal 1.3.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for protobuf==4.25.3 Best match: protobuf 4.25.3 Adding protobuf 4.25.3 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for msgpack==1.0.8 Best match: msgpack 1.0.8 Adding msgpack 1.0.8 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for rpds-py==0.18.0 Best match: rpds-py 0.18.0 Adding rpds-py 0.18.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for jsonschema-specifications==2023.12.1 Best match: jsonschema-specifications 2023.12.1 Adding jsonschema-specifications 2023.12.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for attrs==23.2.0 Best match: attrs 23.2.0 Adding attrs 23.2.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for llvmlite==0.42.0 Best match: llvmlite 0.42.0 Adding llvmlite 0.42.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for MarkupSafe==2.1.5 Best match: MarkupSafe 2.1.5 Adding MarkupSafe 2.1.5 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for anyio==4.3.0 Best match: anyio 4.3.0 Adding anyio 4.3.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for nvidia-nvjitlink-cu12==12.4.99 Best match: nvidia-nvjitlink-cu12 12.4.99 Adding nvidia-nvjitlink-cu12 12.4.99 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for mpmath==1.3.0 Best match: mpmath 1.3.0 Adding mpmath 1.3.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for exceptiongroup==1.2.0 Best match: exceptiongroup 1.2.0 Adding exceptiongroup 1.2.0 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Searching for sniffio==1.3.1 Best match: sniffio 1.3.1 Adding sniffio 1.3.1 to easy-install.pth file Using /usr/local/lib/python3.10/dist-packages Finished processing dependencies for vllm==0.4.0 the problem is weird, i couldn't find it out. @youkaichao |
Looks like a network problem. |
Hi I'm still running into issues with v0.4.0.post1, though it includes the fix #3770 so I'm not running into
|
Looks like your program is killed by |
Oh I see. It's first SIGTERM, then we are calling |
Hi @youkaichao, I've run into the same problem with an NCCL error. I'm working with vLLM version 0.4.0 and CUDA 12.1. Here are some details about my setup:
Error message:
|
@JasmondL I was able to resolve the error with different This seems to be related with pytorch/pytorch#113245 (comment) |
same |
same issue on Tesla T4 GPU with v0.4.0.post1 |
@njhill, @youkaichao , I have the same issue for v0.4.0.post1 using the latest mainline (04/14/2024) source with cuda 11.8 (Nvidia) and it reported the same error in #3770. Basically, each Ray process can only find one GPU count as total, instead of the true total GPU number (Always torch.cuda.device_count=1). Pining the version of ray (2.9.3) instead of using 2.10.0 as #3770 does not work for me (#3699). I reverted the current version to 0ce0539 (04/07/2024), it resolves this device initialization issue. |
@kn1011 are you still experiencing this error? |
I got it working now!
…On Fri, Apr 19, 2024 at 8:26 PM Harry Mellor ***@***.***> wrote:
@kn1011 <https://github.com/kn1011> are you still experiencing this error?
—
Reply to this email directly, view it on GitHub
<#3722 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFAWHZJIHDERY35CSKJKKDY6GY5BAVCNFSM6AAAAABFN77JS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGQZDCNZTGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Do you know if the latest mainline has fixed this issue?
…On Sun, Apr 21, 2024 at 7:05 PM Katrina Ni ***@***.***> wrote:
I got it working now!
On Fri, Apr 19, 2024 at 8:26 PM Harry Mellor ***@***.***>
wrote:
> @kn1011 <https://github.com/kn1011> are you still experiencing this
error?
>
> —
> Reply to this email directly, view it on GitHub
> <
#3722 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACFAWHZJIHDERY35CSKJKKDY6GY5BAVCNFSM6AAAAABFN77JS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGQZDCNZTGY>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#3722 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC2Y4YXWZQD6V6GFQIN7IGTY6RV5PAVCNFSM6AAAAABFN77JS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYGM2TMNBYGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@nilichen what did you do to get it working? |
Note though I'm doing something different and initializing the generator myself.
|
Your current environment
vllm(0.3.3) on ray(2.10.0) cluster deployed by docker on 2 nodes with 2 GPU(Tesla T4) each.
linux environment
root@ai151:/vllm-workspace# envNV_LIBCUBLAS_VERSION=12.1.0.26-1
NVIDIA_VISIBLE_DEVICES=all
NV_NVML_DEV_VERSION=12.1.55-1
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1
HOSTNAME=ai151
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-1=12.1.0.26-1
NV_NVTX_VERSION=12.1.66-1
NV_CUDA_CUDART_DEV_VERSION=12.1.55-1
NV_LIBCUSPARSE_VERSION=12.0.2.55-1
NV_LIBNPP_VERSION=12.0.2.50-1
NCCL_VERSION=2.17.1-1
PWD=/vllm-workspace
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=12.1.55-1
NV_LIBNPP_PACKAGE=libnpp-12-1=12.0.2.50-1
NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
NV_LIBCUBLAS_DEV_VERSION=12.1.0.26-1
NVIDIA_PRODUCT_NAME=CUDA
NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-1
NV_CUDA_CUDART_VERSION=12.1.55-1
HOME=/root
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
NVIDIA_CUDA_END_OF_LIFE=1
CUDA_VERSION=12.1.0
NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1.0.26-1
NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-1=12.1.0-1
NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12.0.2.50-1
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1
NV_LIBNPP_DEV_VERSION=12.0.2.50-1
LESSCLOSE=/usr/bin/lesspipe %s %s
TERM=xterm
NV_LIBCUSPARSE_DEV_VERSION=12.0.2.55-1
LESSOPEN=| /usr/bin/lesspipe %s
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
SHLVL=1
NV_CUDA_LIB_VERSION=12.1.0-1
NVARCH=x86_64
NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1
NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.0-1
NV_NVPROF_VERSION=12.1.55-1
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NV_LIBNCCL_PACKAGE_NAME=libnccl2
NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1
_=/usr/bin/env
pip list
root@ai151:/vllm-workspace# pip listPackage Version
------------------------- ---------------
accelerate 0.28.0
aiofiles 23.2.1
aiohttp 3.9.3
aiohttp-cors 0.7.0
aiosignal 1.3.1
altair 5.2.0
annotated-types 0.6.0
anyio 4.3.0
async-timeout 4.0.3
attrs 23.2.0
awscli 1.32.70
botocore 1.34.70
cachetools 5.3.3
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.28.4
codespell 2.2.6
colorama 0.4.4
colorful 0.5.6
contourpy 1.2.0
cycler 0.12.1
deepspeed 0.14.0
diskcache 5.6.3
distlib 0.3.8
distro 1.9.0
docutils 0.16
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.110.0
ffmpy 0.3.2
filelock 3.13.3
flash-attn 2.5.6
fonttools 4.50.0
frozenlist 1.4.1
fsspec 2024.3.1
google-api-core 2.18.0
google-auth 2.29.0
googleapis-common-protos 1.63.0
gradio 4.24.0
gradio_client 0.14.0
grpcio 1.62.1
h11 0.14.0
hjson 3.1.0
httpcore 1.0.4
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.22.1
idna 3.6
importlib_resources 6.4.0
iniconfig 2.0.0
interegular 0.3.3
isort 5.13.2
Jinja2 3.1.3
jmespath 1.0.1
joblib 1.3.2
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lark 1.1.9
llvmlite 0.42.0
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
mdurl 0.1.2
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
mypy 0.991
mypy-extensions 1.0.0
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.1
numba 0.59.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
openai 1.14.3
opencensus 0.11.4
opencensus-context 0.1.3
orjson 3.10.0
outlines 0.0.34
packaging 24.0
pandas 2.2.1
peft 0.10.0
pillow 10.2.0
pip 22.0.2
platformdirs 4.2.0
pluggy 1.4.0
prometheus_client 0.20.0
proto-plus 1.23.0
protobuf 4.25.3
psutil 5.9.8
py 1.11.0
py-cpuinfo 9.0.0
py-spy 0.3.14
pyasn1 0.5.1
pyasn1_modules 0.4.0
pydantic 2.6.4
pydantic_core 2.16.3
pydub 0.25.1
Pygments 2.17.2
pynvml 11.5.0
pyparsing 3.1.2
pytest 8.1.1
pytest-asyncio 0.23.6
pytest-forked 1.6.0
pytest-rerunfailures 14.0
pytest-shard 0.1.2
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
ray 2.10.0
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
rich 13.7.1
rpds-py 0.18.0
rsa 4.7.2
ruff 0.3.4
s3transfer 0.10.1
safetensors 0.4.2
scipy 1.12.0
semantic-version 2.10.0
sentencepiece 0.2.0
setuptools 59.6.0
shellingham 1.5.4
six 1.16.0
smart-open 7.0.4
sniffio 1.3.1
starlette 0.36.3
sympy 1.12
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.1
torch 2.1.2
tqdm 4.66.2
transformers 4.39.1
triton 2.1.0
typer 0.11.0
types-PyYAML 6.0.12.20240311
types-requests 2.31.0.20240311
types-setuptools 69.2.0.20240317
typing_extensions 4.10.0
tzdata 2024.1
urllib3 2.2.1
uvicorn 0.29.0
uvloop 0.19.0
virtualenv 20.25.1
vllm 0.3.3
watchfiles 0.21.0
websockets 11.0.3
wheel 0.37.1
wrapt 1.16.0
xformers 0.0.23.post1
yapf 0.32.0
yarl 1.9.4
🐛 Describe the bug
vllm works good with argument
--tensor-parallel-size 2
, but sucks with--tensor-parallel-size 4
RuntimeError: CUDA error: invalid device ordinal
root@ai151:/vllm-workspace# python3 -m vllm.entrypoints.api_server --model /models/openchat-3.5-0106/ --tensor-parallel-size 4 --dtype float16 --enforce-eagerWARNING 03-29 13:57:06 config.py:732] Casting torch.bfloat16 to torch.float16.
2024-03-29 13:57:06,969 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 10.4.80.151:6379...
2024-03-29 13:57:06,980 INFO worker.py:1743 -- Connected to Ray cluster. View the dashboard at 10.4.80.151:8265
INFO 03-29 13:57:09 llm_engine.py:70] Initializing an LLM engine (v0.3.3) with config: model=/models/openchat-3.5-0106/, tokenizer=/models/openchat-3.5-0106/, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-29 13:57:22 pynccl.py:49] Loading nccl from library libnccl.so
INFO 03-29 13:57:22 pynccl_utils.py:13] vLLM is using nccl==2.17.1
INFO 03-29 13:57:23 selector.py:33] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 03-29 13:57:23 selector.py:20] Using XFormers backend.
(RayWorkerVllm pid=392, ip=10.4.80.152) INFO 03-29 13:57:16 pynccl.py:49] Loading nccl from library libnccl.so
(RayWorkerVllm pid=392, ip=10.4.80.152) INFO 03-29 13:57:16 pynccl_utils.py:13] vLLM is using nccl==2.17.1
(RayWorkerVllm pid=11442) INFO 03-29 13:57:25 selector.py:33] Cannot use FlashAttention backend for Volta and Turing GPUs.
(RayWorkerVllm pid=11442) INFO 03-29 13:57:25 selector.py:20] Using XFormers backend.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py, line 37, in execute_method
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] return executor(*args, **kwargs)
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py, line 100, in init_device
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py, line 286, in init_distributed_environment
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] pynccl_utils.init_process_group(
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/pynccl_utils.py, line 42, in init_process_group
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] comm = NCCLCommunicator(init_method=init_method,
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/pynccl.py, line 226, in __init__
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] torch.cuda.set_device(self.rank)
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] File /usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py, line 404, in set_device
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] torch._C._cuda_setDevice(device)
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] RuntimeError: CUDA error: invalid device ordinal
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerVllm pid=309, ip=10.4.80.152) ERROR 03-29 13:57:18 ray_utils.py:44]
(RayWorkerVllm pid=309, ip=10.4.80.152) Exception ignored in:
(RayWorkerVllm pid=309, ip=10.4.80.152) Traceback (most recent call last):
(RayWorkerVllm pid=309, ip=10.4.80.152) File /usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/pynccl.py, line 260, in __del__
(RayWorkerVllm pid=309, ip=10.4.80.152) _c_ncclCommDestroy(self.comm)
(RayWorkerVllm pid=309, ip=10.4.80.152) AttributeError: NCCLCommunicator object has no attribute comm
The text was updated successfully, but these errors were encountered: