[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

lbeisteiner · 2025-02-21T14:43:46Z

Your current environment

I'm using the vllm/vllm-openai:v0.7.3 docker image.

🐛 Describe the bug

I'm trying to run [ngram] speculative decoding on vllm v1 using the following parameters on a fine-tuned Llama-3.2-3B:

args: ["--model", "/models", "--disable-log-requests", "--max-model-len", "350", "--tensor-parallel-size", "1", "--port", "8080", "--speculative-model", "[ngram]", "--num-speculative-tokens", "3", "--ngram-prompt-lookup-max", "4", "--ngram-prompt-lookup-min", "3"]

The server starts up correctly, but after sending a few concurrent requests (~5 RPS), I receive

ERROR 02-21 06:30:25 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 284, in run_engine_core
ERROR 02-21 06:30:25 core.py:291]     engine_core.run_busy_loop()
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 327, in run_busy_loop
ERROR 02-21 06:30:25 core.py:291]     outputs = step_fn()
ERROR 02-21 06:30:25 core.py:291]               ^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 154, in step
ERROR 02-21 06:30:25 core.py:291]     output = self.model_executor.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.collective_rpc("execute_model",
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 02-21 06:30:25 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.model_runner.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 997, in execute_model
ERROR 02-21 06:30:25 core.py:291]     gen_lens = valid_mask.sum(dim=1).tolist()
ERROR 02-21 06:30:25 core.py:291]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 02-21 06:30:25 core.py:291] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-21 06:30:25 core.py:291] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-21 06:30:25 core.py:291] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-21 06:30:25 core.py:291]
ERROR 02-21 06:30:25 core.py:291]
CRITICAL 02-21 06:30:25 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Full Logs
You can see the initial successful requests and the error happening after. The worker process terminates and no further requests can be processed.

INFO 02-21 06:26:51 __init__.py:207] Automatically detected platform cuda.
INFO 02-21 06:26:54 api_server.py:912] vLLM API server version 0.7.3
INFO 02-21 06:26:54 api_server.py:913] args: Namespace(host=None, port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/models', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=350, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model='[ngram]', speculative_model_quantization=None, num_speculative_tokens=3, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=4, ngram_prompt_lookup_min=3, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
WARNING 02-21 06:26:54 arg_utils.py:1385] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
INFO 02-21 06:26:59 config.py:549] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 02-21 06:26:59 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 02-21 06:26:59 config.py:705] Async output processing is not supported with speculative decoding currently.
INFO 02-21 06:26:59 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/models', speculative_config=SpeculativeConfig(draft_model='[ngram]', num_spec_tokens=3), tokenizer='/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=350, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 02-21 06:27:00 utils.py:2262] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fe24b25be30>
INFO 02-21 06:27:01 gpu_model_runner.py:1049] Starting to load model /models...
INFO 02-21 06:27:01 cuda.py:157] Using Flash Attention backend on V1 engine.
INFO 02-21 06:27:01 topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
INFO 02-21 06:27:01 rejection_sampler.py:37] Using FlashInfer for rejection sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:12<00:12, 12.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:15<00:00,  7.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:15<00:00,  7.98s/it]
INFO 02-21 06:27:18 gpu_model_runner.py:1060] Loading model weights took 6.0160 GB
INFO 02-21 06:27:23 backends.py:408] Using cache directory: /root/.cache/vllm/torch_compile_cache/d82fe9868f/rank_0 for vLLM's torch.compile
INFO 02-21 06:27:23 backends.py:418] Dynamo bytecode transform time: 5.39 s
INFO 02-21 06:27:26 backends.py:132] Cache the graph of shape None for later use
INFO 02-21 06:27:42 backends.py:144] Compiling a graph for general shape takes 18.69 s
INFO 02-21 06:27:44 monitor.py:33] torch.compile takes 24.08 s in total
INFO 02-21 06:27:45 kv_cache_utils.py:522] # GPU blocks: 19259
INFO 02-21 06:27:45 kv_cache_utils.py:525] Maximum concurrency for 350 tokens per request: 880.41x
INFO 02-21 06:28:07 gpu_model_runner.py:1339] Graph capturing finished in 22 secs, took 0.45 GiB
INFO 02-21 06:28:07 core.py:116] init engine (profile, create kv cache, warmup model) took 49.50 seconds
INFO 02-21 06:28:07 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8080
INFO 02-21 06:28:07 launcher.py:23] Available routes are:
INFO 02-21 06:28:07 launcher.py:31] Route: /openapi.json, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /docs, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /redoc, Methods: GET, HEAD
INFO 02-21 06:28:07 launcher.py:31] Route: /health, Methods: GET
INFO 02-21 06:28:07 launcher.py:31] Route: /ping, Methods: GET, POST
INFO 02-21 06:28:07 launcher.py:31] Route: /tokenize, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /detokenize, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/models, Methods: GET
INFO 02-21 06:28:07 launcher.py:31] Route: /version, Methods: GET
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /pooling, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /score, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/score, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /rerank, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 02-21 06:28:07 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:46900 - "GET /metrics HTTP/1.1" 200 OK
INFO:     100.74.18.189:39520 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:39520 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:01 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:38246 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:38262 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:38232 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:08 loggers.py:78] Avg prompt throughput: 5.0 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:48804 - "GET /metrics HTTP/1.1" 200 OK
INFO:     100.74.18.189:42376 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42382 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:13 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:29:18 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:50932 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:50936 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:23 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:30 loggers.py:78] Avg prompt throughput: 7.2 tokens/s, Avg generation throughput: 5.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:41998 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42004 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42020 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:35 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:40 loggers.py:78] Avg prompt throughput: 6.5 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43174 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:43180 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:43166 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-21 06:29:47 loggers.py:78] Avg prompt throughput: 4.1 tokens/s, Avg generation throughput: 4.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:29:52 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:37328 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:37332 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:29:57 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:30:02 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:44616 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:44616 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:30:07 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     127.0.0.1:56662 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-21 06:30:12 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:54116 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:54118 - "GET /health HTTP/1.1" 200 OK
INFO 02-21 06:30:17 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 02-21 06:30:22 loggers.py:78] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     100.74.18.189:42812 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42818 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42836 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42838 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42842 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:42836 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42812 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42818 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42846 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42824 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42846 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42836 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42818 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     100.74.18.189:42812 - "POST /v1/completions HTTP/1.1" 200 OK
ERROR 02-21 06:30:25 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 284, in run_engine_core
ERROR 02-21 06:30:25 core.py:291]     engine_core.run_busy_loop()
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 327, in run_busy_loop
ERROR 02-21 06:30:25 core.py:291]     outputs = step_fn()
ERROR 02-21 06:30:25 core.py:291]               ^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 154, in step
ERROR 02-21 06:30:25 core.py:291]     output = self.model_executor.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 75, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.collective_rpc("execute_model",
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 02-21 06:30:25 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
ERROR 02-21 06:30:25 core.py:291]     output = self.model_runner.execute_model(scheduler_output)
ERROR 02-21 06:30:25 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-21 06:30:25 core.py:291]     return func(*args, **kwargs)
ERROR 02-21 06:30:25 core.py:291]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 997, in execute_model
ERROR 02-21 06:30:25 core.py:291]     gen_lens = valid_mask.sum(dim=1).tolist()
ERROR 02-21 06:30:25 core.py:291]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 06:30:25 core.py:291] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 02-21 06:30:25 core.py:291] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-21 06:30:25 core.py:291] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-21 06:30:25 core.py:291] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-21 06:30:25 core.py:291]
ERROR 02-21 06:30:25 core.py:291]
CRITICAL 02-21 06:30:25 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
INFO:     100.74.18.189:37356 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:37364 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:57270 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:57278 - "GET /health HTTP/1.1" 200 OK
INFO:     100.74.18.189:39742 - "GET /health HTTP/1.1" 200 OK

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

WoosukKwon · 2025-02-22T16:50:35Z

Hi @lbeisteiner, thanks for reporting the bug!
I experienced the same error. The error disappears when I disable the flash infer sampling kernel.

lbeisteiner added the bug Something isn't working label Feb 21, 2025

lbeisteiner mentioned this issue Feb 22, 2025

[V1] Feedback Thread #12568

Open

hmellor added this to [V1] Speculative Decoding Feb 22, 2025

hmellor moved this to Backlog in [V1] Speculative Decoding Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

lbeisteiner commented Feb 21, 2025

WoosukKwon commented Feb 22, 2025 •

edited

Loading

[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

Comments

lbeisteiner commented Feb 21, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...

WoosukKwon commented Feb 22, 2025 • edited Loading

WoosukKwon commented Feb 22, 2025 •

edited

Loading