Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

Open
1 task done
lbeisteiner opened this issue Feb 21, 2025 · 1 comment
Open
1 task done
Labels
bug Something isn't working

Comments

@lbeisteiner
Copy link

Your current environment

I'm using the vllm/vllm-openai:v0.7.3 docker image.

🐛 Describe the bug

I'm trying to run [ngram] speculative decoding on vllm v1 using the following parameters on a fine-tuned Llama-3.2-3B:

args: ["--model", "/models", "--disable-log-requests", "--max-model-len", "350", "--tensor-parallel-size", "1", "--port", "8080", "--speculative-model", "[ngram]", "--num-speculative-tokens", "3", "--ngram-prompt-lookup-max", "4", "--ngram-prompt-lookup-min", "3"]

The server starts up correctly, but after sending a few concurrent requests (~5 RPS), I receive

ERROR 02-21 06:30:25 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 284, in run_engine_core
ERROR 02-21 06:30:25 core.py:291]     engine_core.run_busy_loop()
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 327, in run_busy_loop
ERROR 02-21 06:30:25 core.py:291]     outputs = step_fn()
ERROR 02-21 06:30:25 core.py:291]               ^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 154, in step
ERROR 02-21 06:30:25 core.py:291]     output = self.model_executor.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.collective_rpc("execute_model",
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 02-21 06:30:25 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.model_runner.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 997, in execute_model
ERROR 02-21 06:30:25 core.py:291]     gen_lens = valid_mask.sum(dim=1).tolist()
ERROR 02-21 06:30:25 core.py:291]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 02-21 06:30:25 core.py:291] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-21 06:30:25 core.py:291] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-21 06:30:25 core.py:291] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-21 06:30:25 core.py:291]
ERROR 02-21 06:30:25 core.py:291]
CRITICAL 02-21 06:30:25 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Full Logs
You can see the initial successful requests and the error happening after. The worker process terminates and no further requests can be processed.

INFO 02-21 06:26:51 __init__.py:207] Automatically detected platform cuda.
INFO 02-21 06:26:54 api_server.py:912] vLLM API server version 0.7.3
INFO 02-21 06:26:54 api_server.py:913] args: Namespace(host=None, port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/models', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=350, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model='[ngram]', speculative_model_quantization=None, num_speculative_tokens=3, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=4, ngram_prompt_lookup_min=3, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
WARNING 02-21 06:26:54 arg_utils.py:1385] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
INFO 02-21 06:26:59 config.py:549] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 02-21 06:26:59 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 02-21 06:26:59 config.py:705] Async output processing is not supported with speculative decoding currently.
INFO 02-21 06:26:59 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/models', speculative_config=SpeculativeConfig(draft_model='[ngram]', num_spec_tokens=3), tokenizer='/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=350, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 02-21 06:27:00 utils.py:2262] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fe24b25be30>
INFO 02-21 06:27:01 gpu_model_runner.py:1049] Starting to load model /models...
INFO 02-21 06:27:01 cuda.py:157] Using Flash Attention backend on V1 engine.
INFO 02-21 06:27:01 topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
INFO 02-21 06:27:01 rejection_sampler.py:37] Using FlashInfer for rejection sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:12<00:12, 12.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:15<00:00,  7.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:15<00:00,  7.98s/it]
INFO 02-21 06:27:18 gpu_model_runner.py:1060] Loading model weights took 6.0160 GB
INFO 02-21 06:27:23 backends.py:408] Using cache directory: /root/.cache/vllm/torch_compile_cache/d82fe9868f/rank_0 for vLLM's torch.compile
INFO 02-21 06:27:23 backends.py:418] Dynamo bytecode transform time: 5.39 s
INFO 02-21 06:27:26 backends.py:132] Cache the graph of shape None for later use
INFO 02-21 06:27:42 backends.py:144] Compiling a graph for general shape takes 18.69 s
INFO 02-21 06:27:44 monitor.py:33] torch.compile takes 24.08 s in total
INFO 02-21 06:27:45 kv_cache_utils.py:522] # GPU blocks: 19259
INFO 02-21 06:27:45 kv_cache_utils.py:525] Maximum concurrency for 350 tokens per request: 880.41x
INFO 02-21 06:28:07 gpu_model_runner.py:1339] Graph capturing finished in 22 secs, took 0.45 GiB
INFO 02-21 06:28:07 core.py:116] init engine (profile, create kv cache, warmup model) took 49.50 seconds
INFO 02-21 06:28:07 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 02-21 06:28:07 launcher.py:23] Available routes are:
INFO 02-21 06:28:07 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /health, Methods: GET
INFO 02-21 06:28:07 launcher.py:31] Route: /ping, Methods: GET, POST
INFO 02-21 06:28:07 launcher.py:31] Route: /tokenize, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /detokenize, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/models, Methods: GET
INFO 02-21 06:28:07 launcher.py:31] Route: /version, Methods: GET
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /pooling, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /score, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/score, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /rerank, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:46900 - "GET /metrics HTTP/1.1" 200 OK
INFO:     100.74.18.189:39520 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:39520 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:01 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:38246 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:38262 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:08 loggers.py:78] Avg prompt throughput: 5.0 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:48804 - "GET /metrics HTTP/1.1" 200 OK
INFO:     100.74.18.189:42376 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42382 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:13 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:29:18 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:50932 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:50936 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:23 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:30 loggers.py:78] Avg prompt throughput: 7.2 tokens/s, Avg generation throughput: 5.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42004 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42020 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:35 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:40 loggers.py:78] Avg prompt throughput: 6.5 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43174 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:43180 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:47 loggers.py:78] Avg prompt throughput: 4.1 tokens/s, Avg generation throughput: 4.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:29:52 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:37328 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:37332 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:57 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:30:02 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:44616 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:44616 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:30:07 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:56662 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-21 06:30:12 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:54116 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:54118 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:30:17 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:30:22 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:42812 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42818 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42836 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42838 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42842 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42836 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42812 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42818 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42846 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42846 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42836 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42818 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42812 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 02-21 06:30:25 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 284, in run_engine_core
ERROR 02-21 06:30:25 core.py:291]     engine_core.run_busy_loop()
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 327, in run_busy_loop
ERROR 02-21 06:30:25 core.py:291]     outputs = step_fn()
ERROR 02-21 06:30:25 core.py:291]               ^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 154, in step
ERROR 02-21 06:30:25 core.py:291]     output = self.model_executor.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.collective_rpc("execute_model",
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 02-21 06:30:25 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.model_runner.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 997, in execute_model
ERROR 02-21 06:30:25 core.py:291]     gen_lens = valid_mask.sum(dim=1).tolist()
ERROR 02-21 06:30:25 core.py:291]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 02-21 06:30:25 core.py:291] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-21 06:30:25 core.py:291] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-21 06:30:25 core.py:291] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-21 06:30:25 core.py:291]
ERROR 02-21 06:30:25 core.py:291]
CRITICAL 02-21 06:30:25 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
INFO:     100.74.18.189:37356 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:37364 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:57270 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:57278 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:39742 - "GET /health HTTP/1.1" 200 OK

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Feb 22, 2025

Hi @lbeisteiner, thanks for reporting the bug!
I experienced the same error. The error disappears when I disable the flash infer sampling kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

2 participants