[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

rkooo567 · 2024-04-26T01:46:04Z

Your current environment

H100 (but I believe it happens in any machine)

🐛 Describe the bug

--enable-chunked-prefill --num-max-batched-tokens 2048 --kv-cache-dtype "fp8"

Seems to be broken with some type incompatibility error.

The text was updated successfully, but these errors were encountered:

josephrocca · 2024-06-03T14:19:32Z

Yep, can confirm. I think it's undocumented that using both together is not supported? I get this error on a dual 4090 machine:

2024-06-03T14:15:05.332567820Z     raise CompilationError(fn.src, node, repr(e)) from e
2024-06-03T14:15:05.332573240Z triton.compiler.errors.CompilationError: at 114:24:        off_v = (
2024-06-03T14:15:05.332578110Z             bn[:, None] * stride_v_cache_bs +
2024-06-03T14:15:05.332588389Z             cur_kv_head * stride_v_cache_h +
2024-06-03T14:15:05.332593288Z             offs_d[None, :] * stride_v_cache_d +
2024-06-03T14:15:05.332598147Z             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
2024-06-03T14:15:05.332602996Z         k = tl.load(K_cache + off_k,
2024-06-03T14:15:05.332607825Z                     mask=dim_mask[:, None] &
2024-06-03T14:15:05.332612695Z                     ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
2024-06-03T14:15:05.332617564Z                     other=0.0)  # [D,N]
2024-06-03T14:15:05.332622383Z 
2024-06-03T14:15:05.332627142Z         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
2024-06-03T14:15:05.332632532Z         qk += tl.dot(q, k)
2024-06-03T14:15:05.332637411Z                         ^
2024-06-03T14:15:05.332642260Z AssertionError('Both operands must be same type. First operand (fp16) and second operand (uint8)')

Some other engine args that I used, in case they're relevant:

--quantization gptq --dtype float16 --enforce-eager --tensor-parallel-size 2

rkooo567 · 2024-06-03T15:09:57Z

Let me make a PR to raise an error for now. cc @comaniac I believe you made this work before. Did you use kv cache dtype fp 8?

comaniac · 2024-06-03T16:20:01Z

It should work with xformers backend with paged attention, but I'm not sure if that works with GPTQ.

K-Mistele · 2024-08-06T04:36:16Z

Same issue here. I am using llama 3.1 8B which has a context length of 128k. Chunked prefill is automatically enabled for models over a certain sequence length (128k is over it) and I found that I had to set --enable-chunked-prefill False in order to use --kv-cache-dtype fp8

comaniac · 2024-08-06T04:45:23Z

That's not expected. I'll file a PR to automatically disable chunked prefill for now if fp8 kv-cache is enabled.

K-Mistele · 2024-08-06T05:00:13Z

I know it's super long but here's the full trace:

the full very long trace

(constellate-vllm) (venv) constellate@1-ai-appserver-staging:/mnt/disk/AI/constellate-vllm$ python -m vllm.entrypoints.openai.api_server --dtype half --kv-cache-dtype fp8 --model meta-llama/Meta-Llama-3.1-8B-Instruct --chat-template examples/tool_chat_template_llama_3_1.jinja --enable-auto-tool-choice --tool-call-parser llama3.1
INFO 08-05 23:12:08 api_server.py:370] vLLM API server version 0.5.3.post1
INFO 08-05 23:12:08 api_server.py:371] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='examples/tool_chat_template_llama_3_1.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=True, tool_call_parser='llama3.1', model='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-05 23:12:10 config.py:1439] Casting torch.bfloat16 to torch.float16.
WARNING 08-05 23:12:11 config.py:1439] Casting torch.bfloat16 to torch.float16.
INFO 08-05 23:12:11 config.py:483] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 08-05 23:12:11 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-05 23:12:11 config.py:819] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-05 23:12:11 llm_engine.py:174] Initializing an LLM engine (v0.5.3.post1) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-05 23:12:11 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-05 23:12:11 selector.py:54] Using XFormers backend.
/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-05 23:12:12 model_runner.py:720] Starting to load model meta-llama/Meta-Llama-3.1-8B-Instruct...
INFO 08-05 23:12:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-05 23:12:13 selector.py:54] Using XFormers backend.
INFO 08-05 23:12:13 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.84it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:02,  1.01s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.25s/it]

INFO 08-05 23:12:18 model_runner.py:732] Loading model weights took 14.9888 GB
INFO 08-05 23:12:19 gpu_executor.py:102] # GPU blocks: 12723, # CPU blocks: 4096
INFO 08-05 23:12:22 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-05 23:12:22 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-05 23:12:39 model_runner.py:1225] Graph capturing finished in 18 secs.
INFO 08-05 23:12:40 chat_utils.py:53] Using supplied chat template:
INFO 08-05 23:12:40 chat_utils.py:53] {{- bos_token }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if custom_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = custom_tools %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools_in_user_message is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools_in_user_message = true %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not date_string is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set date_string = "26 Jul 2024" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = none %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- This block extracts the system message, so we can slot it into the right place. #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages[0]['role'] == 'system' %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = "" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- System message + builtin tools #}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined or tools is not none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Environment: ipython\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- REMOVED BUILTIN TOOLS - NOT USED OR NEEDED FOR OPENAI COMPATIBILITY
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] #}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Cutting Knowledge Date: December 2023\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Today Date: " + date_string if date_string else '5 Aug 2024' + "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools is not none and not tools_in_user_message %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {{- system_message }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- Custom tools are passed in a user message with some extra guidance #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools_in_user_message and not tools is none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {#- Extract the first user message so we can plug it in here #}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if messages | length != 0 %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set first_user_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Given the following functions, please respond with a JSON for a function call " }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "with its proper arguments that best answers the given prompt.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- first_user_message + "<|eot_id|>"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {%- for message in messages %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif 'tool_calls' in message %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if not message.tool_calls|length == 1 %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- raise_exception("This model only supports single tool-calls at once!") }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set tool_call = message.tool_calls[0].function %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined and tool_call.name in builtin_tools %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|python_tag|>" + tool_call.name + ".call(" }}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- for arg_name, arg_val in tool_call.arguments | items %}
INFO 08-05 23:12:40 chat_utils.py:53]                 {{- arg_name + '="' + arg_val + '"' }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- if not loop.last %}
INFO 08-05 23:12:40 chat_utils.py:53]                     {{- ", " }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- ")" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else  %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '{"name": "' + tool_call.name + '", ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '"parameters": ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- tool_call.arguments | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "}" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]             {#- This means we're in ipython mode #}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eom_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif message.role == "tool" or message.role == "ipython" %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if message.content is mapping or message.content is iterable %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if add_generation_prompt %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 serving_chat.py:80] "Auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
WARNING 08-05 23:12:40 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-05 23:12:40 chat_utils.py:53] Using supplied chat template:
INFO 08-05 23:12:40 chat_utils.py:53] {{- bos_token }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if custom_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = custom_tools %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools_in_user_message is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools_in_user_message = true %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not date_string is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set date_string = "26 Jul 2024" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = none %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- This block extracts the system message, so we can slot it into the right place. #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages[0]['role'] == 'system' %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = "" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- System message + builtin tools #}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined or tools is not none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Environment: ipython\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- REMOVED BUILTIN TOOLS - NOT USED OR NEEDED FOR OPENAI COMPATIBILITY
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] #}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Cutting Knowledge Date: December 2023\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Today Date: " + date_string if date_string else '5 Aug 2024' + "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools is not none and not tools_in_user_message %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {{- system_message }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- Custom tools are passed in a user message with some extra guidance #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools_in_user_message and not tools is none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {#- Extract the first user message so we can plug it in here #}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if messages | length != 0 %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set first_user_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Given the following functions, please respond with a JSON for a function call " }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "with its proper arguments that best answers the given prompt.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- first_user_message + "<|eot_id|>"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {%- for message in messages %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif 'tool_calls' in message %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if not message.tool_calls|length == 1 %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- raise_exception("This model only supports single tool-calls at once!") }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set tool_call = message.tool_calls[0].function %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined and tool_call.name in builtin_tools %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|python_tag|>" + tool_call.name + ".call(" }}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- for arg_name, arg_val in tool_call.arguments | items %}
INFO 08-05 23:12:40 chat_utils.py:53]                 {{- arg_name + '="' + arg_val + '"' }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- if not loop.last %}
INFO 08-05 23:12:40 chat_utils.py:53]                     {{- ", " }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- ")" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else  %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '{"name": "' + tool_call.name + '", ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '"parameters": ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- tool_call.arguments | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "}" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]             {#- This means we're in ipython mode #}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eom_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif message.role == "tool" or message.role == "ipython" %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if message.content is mapping or message.content is iterable %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if add_generation_prompt %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 api_server.py:346] Available routes are:
INFO 08-05 23:12:40 api_server.py:351] Route: /openapi.json, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /docs, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /redoc, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /health, Methods: GET
INFO 08-05 23:12:40 api_server.py:351] Route: /tokenize, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /detokenize, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/models, Methods: GET
INFO 08-05 23:12:40 api_server.py:351] Route: /version, Methods: GET
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/chat/completions, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/completions, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [3431585]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 08-05 23:12:43 serving_chat.py:149] Created full prompt
INFO 08-05 23:12:43 serving_chat.py:150] <|begin_of_text|><|start_header_id|>system<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Environment: ipython
INFO 08-05 23:12:43 serving_chat.py:150] Cutting Knowledge Date: December 2023
INFO 08-05 23:12:43 serving_chat.py:150] Today Date: 26 Jul 2024<|eot_id|><|start_header_id|>user<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] {
INFO 08-05 23:12:43 serving_chat.py:150]     "type": "function",
INFO 08-05 23:12:43 serving_chat.py:150]     "function": {
INFO 08-05 23:12:43 serving_chat.py:150]         "name": "get_current_weather",
INFO 08-05 23:12:43 serving_chat.py:150]         "description": "Get the current weather in a given location",
INFO 08-05 23:12:43 serving_chat.py:150]         "parameters": {
INFO 08-05 23:12:43 serving_chat.py:150]             "type": "object",
INFO 08-05 23:12:43 serving_chat.py:150]             "properties": {
INFO 08-05 23:12:43 serving_chat.py:150]                 "city": {
INFO 08-05 23:12:43 serving_chat.py:150]                     "type": "string",
INFO 08-05 23:12:43 serving_chat.py:150]                     "description": "The city to find the weather for, e.g. 'San Francisco'"
INFO 08-05 23:12:43 serving_chat.py:150]                 },
INFO 08-05 23:12:43 serving_chat.py:150]                 "state": {
INFO 08-05 23:12:43 serving_chat.py:150]                     "type": "string",
INFO 08-05 23:12:43 serving_chat.py:150]                     "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
INFO 08-05 23:12:43 serving_chat.py:150]                 },
INFO 08-05 23:12:43 serving_chat.py:150]                 "unit": {
INFO 08-05 23:12:43 serving_chat.py:150]                     "type": "string",
INFO 08-05 23:12:43 serving_chat.py:150]                     "description": "The unit to fetch the temperature in",
INFO 08-05 23:12:43 serving_chat.py:150]                     "enum": [
INFO 08-05 23:12:43 serving_chat.py:150]                         "celsius",
INFO 08-05 23:12:43 serving_chat.py:150]                         "fahrenheit"
INFO 08-05 23:12:43 serving_chat.py:150]                     ]
INFO 08-05 23:12:43 serving_chat.py:150]                 }
INFO 08-05 23:12:43 serving_chat.py:150]             }
INFO 08-05 23:12:43 serving_chat.py:150]         }
INFO 08-05 23:12:43 serving_chat.py:150]     }
INFO 08-05 23:12:43 serving_chat.py:150] }
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Hi! How are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] I'm doing well! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Can you tell me what the weather will be in Dallas and San Francisco? I like fahrenheit.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 logger.py:36] Received request chat-30088efd6e3645e2b07ea083bb9d7446: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nEnvironment: ipython\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGiven the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.\n\nRespond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.\n\n{\n    "type": "function",\n    "function": {\n        "name": "get_current_weather",\n        "description": "Get the current weather in a given location",\n        "parameters": {\n            "type": "object",\n            "properties": {\n                "city": {\n                    "type": "string",\n                    "description": "The city to find the weather for, e.g. \'San Francisco\'"\n                },\n                "state": {\n                    "type": "string",\n                    "description": "the two-letter abbreviation for the state that the city is in, e.g. \'CA\' which would mean \'California\'"\n                },\n                "unit": {\n                    "type": "string",\n                    "description": "The unit to fetch the temperature in",\n                    "enum": [\n                        "celsius",\n                        "fahrenheit"\n                    ]\n                }\n            }\n        }\n    }\n}\n\nHi! How are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI\'m doing well! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you tell me what the weather will be in Dallas and San Francisco? I like fahrenheit.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=130751, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 13013, 25, 6125, 27993, 198, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 128009, 128006, 882, 128007, 271, 22818, 279, 2768, 5865, 11, 4587, 6013, 449, 264, 4823, 369, 264, 734, 1650, 449, 1202, 6300, 6105, 430, 1888, 11503, 279, 2728, 10137, 382, 66454, 304, 279, 3645, 5324, 609, 794, 734, 836, 11, 330, 14105, 794, 11240, 315, 5811, 836, 323, 1202, 907, 7966, 5519, 539, 1005, 7482, 382, 517, 262, 330, 1337, 794, 330, 1723, 761, 262, 330, 1723, 794, 341, 286, 330, 609, 794, 330, 456, 11327, 70464, 761, 286, 330, 4789, 794, 330, 1991, 279, 1510, 9282, 304, 264, 2728, 3813, 761, 286, 330, 14105, 794, 341, 310, 330, 1337, 794, 330, 1735, 761, 310, 330, 13495, 794, 341, 394, 330, 9103, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 791, 3363, 311, 1505, 279, 9282, 369, 11, 384, 1326, 13, 364, 24661, 13175, 42265, 394, 1173, 394, 330, 2513, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 1820, 1403, 80468, 72578, 369, 279, 1614, 430, 279, 3363, 374, 304, 11, 384, 1326, 13, 364, 5158, 6, 902, 1053, 3152, 364, 46510, 42265, 394, 1173, 394, 330, 3928, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 791, 5089, 311, 7963, 279, 9499, 304, 761, 504, 330, 9195, 794, 2330, 667, 330, 66, 41347, 761, 667, 330, 69, 49010, 702, 504, 5243, 394, 457, 310, 457, 286, 457, 262, 457, 633, 13347, 0, 2650, 527, 499, 3815, 3432, 30, 128009, 128006, 78191, 128007, 271, 40, 2846, 3815, 1664, 0, 2650, 649, 358, 1520, 499, 30, 128009, 128006, 882, 128007, 271, 6854, 499, 3371, 757, 1148, 279, 9282, 690, 387, 304, 19051, 323, 5960, 13175, 30, 358, 1093, 282, 49010, 13, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 08-05 23:12:43 async_llm_engine.py:174] Added request chat-30088efd6e3645e2b07ea083bb9d7446.
ERROR 08-05 23:12:45 async_llm_engine.py:57] Engine background task failed
ERROR 08-05 23:12:45 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return fn(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 1534, in dot
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1355, in dot
ERROR 08-05 23:12:45 async_llm_engine.py:57]     assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1328, in assert_dtypes_valid
ERROR 08-05 23:12:45 async_llm_engine.py:57]     assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!"
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57] AssertionError: First input (fp16) and second input (uint8) must have the same dtype!
ERROR 08-05 23:12:45 async_llm_engine.py:57] 
ERROR 08-05 23:12:45 async_llm_engine.py:57] The above exception was the direct cause of the following exception:
ERROR 08-05 23:12:45 async_llm_engine.py:57] 
ERROR 08-05 23:12:45 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return_value = task.result()
ERROR 08-05 23:12:45 async_llm_engine.py:57]                    ^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
ERROR 08-05 23:12:45 async_llm_engine.py:57]     result = task.result()
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 585, in engine_step
ERROR 08-05 23:12:45 async_llm_engine.py:57]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-05 23:12:45 async_llm_engine.py:57]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 254, in step_async
ERROR 08-05 23:12:45 async_llm_engine.py:57]     output = await self.model_executor.execute_model_async(
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/executor/gpu_executor.py", line 159, in execute_model_async
ERROR 08-05 23:12:45 async_llm_engine.py:57]     output = await make_async(self.driver_worker.execute_model
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 08-05 23:12:45 async_llm_engine.py:57]     result = self.fn(*self.args, **self.kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 08-05 23:12:45 async_llm_engine.py:57]     output = self.model_runner.execute_model(
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 08-05 23:12:45 async_llm_engine.py:57]     hidden_or_intermediate_states = model_executable(
ERROR 08-05 23:12:45 async_llm_engine.py:57]                                     ^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 08-05 23:12:45 async_llm_engine.py:57]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     hidden_states, residual = layer(
ERROR 08-05 23:12:45 async_llm_engine.py:57]                               ^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     hidden_states = self.self_attn(
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     ^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 175, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 08-05 23:12:45 async_llm_engine.py:57]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/layer.py", line 98, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self.impl.forward(query,
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/backends/xformers.py", line 603, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     out = PagedAttention.forward_prefix(
ERROR 08-05 23:12:45 async_llm_engine.py:57]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
ERROR 08-05 23:12:45 async_llm_engine.py:57]     context_attention_fwd(
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
ERROR 08-05 23:12:45 async_llm_engine.py:57]     _fwd_kernel[grid](
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run
ERROR 08-05 23:12:45 async_llm_engine.py:57]     kernel = self.compile(
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile
ERROR 08-05 23:12:45 async_llm_engine.py:57]     module = src.make_ir(options, codegen_fns, context)
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57] triton.compiler.errors.CompilationError: at 114:14:
ERROR 08-05 23:12:45 async_llm_engine.py:57]         off_v = (
ERROR 08-05 23:12:45 async_llm_engine.py:57]             bn[:, None] * stride_v_cache_bs +
ERROR 08-05 23:12:45 async_llm_engine.py:57]             cur_kv_head * stride_v_cache_h +
ERROR 08-05 23:12:45 async_llm_engine.py:57]             offs_d[None, :] * stride_v_cache_d +
ERROR 08-05 23:12:45 async_llm_engine.py:57]             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 08-05 23:12:45 async_llm_engine.py:57]         k = tl.load(K_cache + off_k,
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     mask=dim_mask[:, None] &
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     other=0.0)  # [D,N]
ERROR 08-05 23:12:45 async_llm_engine.py:57] 
ERROR 08-05 23:12:45 async_llm_engine.py:57]         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
ERROR 08-05 23:12:45 async_llm_engine.py:57]         qk += tl.dot(q, k)
ERROR 08-05 23:12:45 async_llm_engine.py:57]               ^
Exception in callback _log_task_completion(error_callback=<bound method...7f999479a510>>)(<Task finishe...de320>, None)>) at /mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f999479a510>>)(<Task finishe...de320>, None)>) at /mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 1534, in dot
    return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1355, in dot
    assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1328, in assert_dtypes_valid
    assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!"
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: First input (fp16) and second input (uint8) must have the same dtype!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 585, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 254, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/executor/gpu_executor.py", line 159, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/worker/worker_base.py", line 273, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 422, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 322, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 245, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 175, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/backends/xformers.py", line 603, in forward
    out = PagedAttention.forward_prefix(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
    context_attention_fwd(
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
    _fwd_kernel[grid](
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run
    kernel = self.compile(
             ^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile
    module = src.make_ir(options, codegen_fns, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
triton.compiler.errors.CompilationError: at 114:14:
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k = tl.load(K_cache + off_k,
                    mask=dim_mask[:, None] &
                    ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
                    other=0.0)  # [D,N]

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
        qk += tl.dot(q, k)
              ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-05 23:12:45 async_llm_engine.py:181] Aborted request chat-30088efd6e3645e2b07ea083bb9d7446.
INFO:     10.3.10.164:65191 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/api_server.py", line 191, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/serving_chat.py", line 241, in create_chat_completion
    generator = await self.chat_completion_full_generator(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/serving_chat.py", line 569, in chat_completion_full_generator
    async for res in result_generator:
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
triton.compiler.errors.CompilationError: at 114:14:
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k = tl.load(K_cache + off_k,
                    mask=dim_mask[:, None] &
                    ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
                    other=0.0)  # [D,N]

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
        qk += tl.dot(q, k)
              ^
INFO 08-05 23:12:50 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-05 23:13:00 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
^CINFO 08-05 23:13:06 server.py:207] vLLM ZMQ RPC Server was interrupted.
INFO 08-05 23:13:06 api_server.py:396] Gracefully stopping http server

K-Mistele · 2024-08-06T05:00:39Z

(on a v100 tesla hence the fp16 instead of bf16)

jon-chuang · 2024-08-06T13:37:18Z

I know it's super long but here's the full trace:

Seems like triton kernel issue, looks fixable. Let me take a look.

Also:

vllm/vllm/attention/backends/xformers.py

Line 600 in 1f26efb

# TODO(Hai) this triton kernel has regression issue (broke) to

Also, is this comment still relevant?

vllm/vllm/worker/model_runner.py

Line 765 in 1f26efb

# Currently only ROCm accepts kv-cache scaling factors

jon-chuang · 2024-08-06T14:01:15Z

Related: #3880, #3156, #3880

jon-chuang · 2024-08-06T14:03:15Z

This PR is directly relevant as the solution: #3234 (although the PR is incomplete and incorrect)

Additional context on FP8 KV Cache: #4532 with solution: #4893

jon-chuang · 2024-08-06T15:32:35Z

Btw, why is this not on the testing path? Where should such a test be included as regression test?

rkooo567 added the bug Something isn't working label Apr 26, 2024

rkooo567 mentioned this issue Jun 25, 2024

[Usage]: how to use enable-chunked-prefill? #5533

Closed

comaniac mentioned this issue Aug 6, 2024

[MISC] Disable chunked prefill with fp8 kv-cache #7185

Closed

jon-chuang mentioned this issue Aug 6, 2024

[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel #7208

Merged

2 tasks

comaniac closed this as completed in #7208 Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

rkooo567 commented Apr 26, 2024 •

edited

Loading

josephrocca commented Jun 3, 2024

rkooo567 commented Jun 3, 2024

comaniac commented Jun 3, 2024

K-Mistele commented Aug 6, 2024

comaniac commented Aug 6, 2024

K-Mistele commented Aug 6, 2024

K-Mistele commented Aug 6, 2024

jon-chuang commented Aug 6, 2024 •

edited

Loading

jon-chuang commented Aug 6, 2024 •

edited

Loading

jon-chuang commented Aug 6, 2024 •

edited

Loading

jon-chuang commented Aug 6, 2024

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

Comments

rkooo567 commented Apr 26, 2024 • edited Loading

Your current environment

🐛 Describe the bug

josephrocca commented Jun 3, 2024

rkooo567 commented Jun 3, 2024

comaniac commented Jun 3, 2024

K-Mistele commented Aug 6, 2024

comaniac commented Aug 6, 2024

K-Mistele commented Aug 6, 2024

K-Mistele commented Aug 6, 2024

jon-chuang commented Aug 6, 2024 • edited Loading

jon-chuang commented Aug 6, 2024 • edited Loading

jon-chuang commented Aug 6, 2024 • edited Loading

jon-chuang commented Aug 6, 2024

rkooo567 commented Apr 26, 2024 •

edited

Loading

jon-chuang commented Aug 6, 2024 •

edited

Loading

jon-chuang commented Aug 6, 2024 •

edited

Loading

jon-chuang commented Aug 6, 2024 •

edited

Loading