Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

Closed
rkooo567 opened this issue Apr 26, 2024 · 11 comments · Fixed by #7208
Closed

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

rkooo567 opened this issue Apr 26, 2024 · 11 comments · Fixed by #7208
Labels
bug Something isn't working

Comments

@rkooo567
Copy link
Collaborator

rkooo567 commented Apr 26, 2024

Your current environment

H100 (but I believe it happens in any machine)

🐛 Describe the bug

--enable-chunked-prefill --num-max-batched-tokens 2048 --kv-cache-dtype "fp8"

Seems to be broken with some type incompatibility error.

@rkooo567 rkooo567 added the bug Something isn't working label Apr 26, 2024
@josephrocca
Copy link

Yep, can confirm. I think it's undocumented that using both together is not supported? I get this error on a dual 4090 machine:

2024-06-03T14:15:05.332567820Z     raise CompilationError(fn.src, node, repr(e)) from e
2024-06-03T14:15:05.332573240Z triton.compiler.errors.CompilationError: at 114:24:        off_v = (
2024-06-03T14:15:05.332578110Z             bn[:, None] * stride_v_cache_bs +
2024-06-03T14:15:05.332588389Z             cur_kv_head * stride_v_cache_h +
2024-06-03T14:15:05.332593288Z             offs_d[None, :] * stride_v_cache_d +
2024-06-03T14:15:05.332598147Z             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
2024-06-03T14:15:05.332602996Z         k = tl.load(K_cache + off_k,
2024-06-03T14:15:05.332607825Z                     mask=dim_mask[:, None] &
2024-06-03T14:15:05.332612695Z                     ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
2024-06-03T14:15:05.332617564Z                     other=0.0)  # [D,N]
2024-06-03T14:15:05.332622383Z 
2024-06-03T14:15:05.332627142Z         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
2024-06-03T14:15:05.332632532Z         qk += tl.dot(q, k)
2024-06-03T14:15:05.332637411Z                         ^
2024-06-03T14:15:05.332642260Z AssertionError('Both operands must be same type. First operand (fp16) and second operand (uint8)')

Some other engine args that I used, in case they're relevant:

--quantization gptq --dtype float16 --enforce-eager --tensor-parallel-size 2

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Jun 3, 2024

Let me make a PR to raise an error for now. cc @comaniac I believe you made this work before. Did you use kv cache dtype fp 8?

@comaniac
Copy link
Collaborator

comaniac commented Jun 3, 2024

It should work with xformers backend with paged attention, but I'm not sure if that works with GPTQ.

@K-Mistele
Copy link
Contributor

Same issue here. I am using llama 3.1 8B which has a context length of 128k. Chunked prefill is automatically enabled for models over a certain sequence length (128k is over it) and I found that I had to set --enable-chunked-prefill False in order to use --kv-cache-dtype fp8

@comaniac
Copy link
Collaborator

comaniac commented Aug 6, 2024

That's not expected. I'll file a PR to automatically disable chunked prefill for now if fp8 kv-cache is enabled.

@K-Mistele
Copy link
Contributor

I know it's super long but here's the full trace:

the full very long trace
(constellate-vllm) (venv) constellate@1-ai-appserver-staging:/mnt/disk/AI/constellate-vllm$ python -m vllm.entrypoints.openai.api_server --dtype half --kv-cache-dtype fp8 --model meta-llama/Meta-Llama-3.1-8B-Instruct --chat-template examples/tool_chat_template_llama_3_1.jinja --enable-auto-tool-choice --tool-call-parser llama3.1
INFO 08-05 23:12:08 api_server.py:370] vLLM API server version 0.5.3.post1
INFO 08-05 23:12:08 api_server.py:371] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template='examples/tool_chat_template_llama_3_1.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=True, tool_call_parser='llama3.1', model='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='fp8', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-05 23:12:10 config.py:1439] Casting torch.bfloat16 to torch.float16.
WARNING 08-05 23:12:11 config.py:1439] Casting torch.bfloat16 to torch.float16.
INFO 08-05 23:12:11 config.py:483] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
WARNING 08-05 23:12:11 arg_utils.py:766] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-05 23:12:11 config.py:819] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-05 23:12:11 llm_engine.py:174] Initializing an LLM engine (v0.5.3.post1) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-8B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-05 23:12:11 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-05 23:12:11 selector.py:54] Using XFormers backend.
/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-05 23:12:12 model_runner.py:720] Starting to load model meta-llama/Meta-Llama-3.1-8B-Instruct...
INFO 08-05 23:12:13 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-05 23:12:13 selector.py:54] Using XFormers backend.
INFO 08-05 23:12:13 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.84it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:02,  1.01s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.25s/it]

INFO 08-05 23:12:18 model_runner.py:732] Loading model weights took 14.9888 GB
INFO 08-05 23:12:19 gpu_executor.py:102] # GPU blocks: 12723, # CPU blocks: 4096
INFO 08-05 23:12:22 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-05 23:12:22 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-05 23:12:39 model_runner.py:1225] Graph capturing finished in 18 secs.
INFO 08-05 23:12:40 chat_utils.py:53] Using supplied chat template:
INFO 08-05 23:12:40 chat_utils.py:53] {{- bos_token }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if custom_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = custom_tools %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools_in_user_message is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools_in_user_message = true %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not date_string is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set date_string = "26 Jul 2024" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = none %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- This block extracts the system message, so we can slot it into the right place. #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages[0]['role'] == 'system' %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = "" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- System message + builtin tools #}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined or tools is not none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Environment: ipython\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- REMOVED BUILTIN TOOLS - NOT USED OR NEEDED FOR OPENAI COMPATIBILITY
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] #}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Cutting Knowledge Date: December 2023\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Today Date: " + date_string if date_string else '5 Aug 2024' + "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools is not none and not tools_in_user_message %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {{- system_message }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- Custom tools are passed in a user message with some extra guidance #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools_in_user_message and not tools is none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {#- Extract the first user message so we can plug it in here #}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if messages | length != 0 %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set first_user_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Given the following functions, please respond with a JSON for a function call " }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "with its proper arguments that best answers the given prompt.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- first_user_message + "<|eot_id|>"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {%- for message in messages %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif 'tool_calls' in message %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if not message.tool_calls|length == 1 %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- raise_exception("This model only supports single tool-calls at once!") }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set tool_call = message.tool_calls[0].function %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined and tool_call.name in builtin_tools %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|python_tag|>" + tool_call.name + ".call(" }}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- for arg_name, arg_val in tool_call.arguments | items %}
INFO 08-05 23:12:40 chat_utils.py:53]                 {{- arg_name + '="' + arg_val + '"' }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- if not loop.last %}
INFO 08-05 23:12:40 chat_utils.py:53]                     {{- ", " }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- ")" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else  %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '{"name": "' + tool_call.name + '", ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '"parameters": ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- tool_call.arguments | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "}" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]             {#- This means we're in ipython mode #}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eom_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif message.role == "tool" or message.role == "ipython" %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if message.content is mapping or message.content is iterable %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if add_generation_prompt %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 serving_chat.py:80] "Auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
WARNING 08-05 23:12:40 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-05 23:12:40 chat_utils.py:53] Using supplied chat template:
INFO 08-05 23:12:40 chat_utils.py:53] {{- bos_token }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if custom_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = custom_tools %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools_in_user_message is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools_in_user_message = true %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not date_string is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set date_string = "26 Jul 2024" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if not tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set tools = none %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- This block extracts the system message, so we can slot it into the right place. #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if messages[0]['role'] == 'system' %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- set system_message = "" %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- System message + builtin tools #}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined or tools is not none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Environment: ipython\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {#- REMOVED BUILTIN TOOLS - NOT USED OR NEEDED FOR OPENAI COMPATIBILITY
INFO 08-05 23:12:40 chat_utils.py:53] {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] #}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Cutting Knowledge Date: December 2023\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "Today Date: " + date_string if date_string else '5 Aug 2024' + "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools is not none and not tools_in_user_message %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {{- system_message }}
INFO 08-05 23:12:40 chat_utils.py:53] {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {#- Custom tools are passed in a user message with some extra guidance #}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if tools_in_user_message and not tools is none %}
INFO 08-05 23:12:40 chat_utils.py:53]     {#- Extract the first user message so we can plug it in here #}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if messages | length != 0 %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set first_user_message = messages[0]['content']|trim %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set messages = messages[1:] %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Given the following functions, please respond with a JSON for a function call " }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "with its proper arguments that best answers the given prompt.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- "Do not use variables.\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- for t in tools %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- t | tojson(indent=4) }}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- first_user_message + "<|eot_id|>"}}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] 
INFO 08-05 23:12:40 chat_utils.py:53] {%- for message in messages %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif 'tool_calls' in message %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if not message.tool_calls|length == 1 %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- raise_exception("This model only supports single tool-calls at once!") }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- set tool_call = message.tool_calls[0].function %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined and tool_call.name in builtin_tools %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|python_tag|>" + tool_call.name + ".call(" }}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- for arg_name, arg_val in tool_call.arguments | items %}
INFO 08-05 23:12:40 chat_utils.py:53]                 {{- arg_name + '="' + arg_val + '"' }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- if not loop.last %}
INFO 08-05 23:12:40 chat_utils.py:53]                     {{- ", " }}
INFO 08-05 23:12:40 chat_utils.py:53]                 {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]             {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- ")" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else  %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '{"name": "' + tool_call.name + '", ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- '"parameters": ' }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- tool_call.arguments | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "}" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if builtin_tools is defined %}
INFO 08-05 23:12:40 chat_utils.py:53]             {#- This means we're in ipython mode #}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eom_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- elif message.role == "tool" or message.role == "ipython" %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- if message.content is mapping or message.content is iterable %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content | tojson }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- else %}
INFO 08-05 23:12:40 chat_utils.py:53]             {{- message.content }}
INFO 08-05 23:12:40 chat_utils.py:53]         {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53]         {{- "<|eot_id|>" }}
INFO 08-05 23:12:40 chat_utils.py:53]     {%- endif %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endfor %}
INFO 08-05 23:12:40 chat_utils.py:53] {%- if add_generation_prompt %}
INFO 08-05 23:12:40 chat_utils.py:53]     {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
INFO 08-05 23:12:40 chat_utils.py:53] {%- endif %}
INFO 08-05 23:12:40 api_server.py:346] Available routes are:
INFO 08-05 23:12:40 api_server.py:351] Route: /openapi.json, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /docs, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /redoc, Methods: HEAD, GET
INFO 08-05 23:12:40 api_server.py:351] Route: /health, Methods: GET
INFO 08-05 23:12:40 api_server.py:351] Route: /tokenize, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /detokenize, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/models, Methods: GET
INFO 08-05 23:12:40 api_server.py:351] Route: /version, Methods: GET
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/chat/completions, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/completions, Methods: POST
INFO 08-05 23:12:40 api_server.py:351] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [3431585]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 08-05 23:12:43 serving_chat.py:149] Created full prompt
INFO 08-05 23:12:43 serving_chat.py:150] <|begin_of_text|><|start_header_id|>system<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Environment: ipython
INFO 08-05 23:12:43 serving_chat.py:150] Cutting Knowledge Date: December 2023
INFO 08-05 23:12:43 serving_chat.py:150] Today Date: 26 Jul 2024<|eot_id|><|start_header_id|>user<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] {
INFO 08-05 23:12:43 serving_chat.py:150]     "type": "function",
INFO 08-05 23:12:43 serving_chat.py:150]     "function": {
INFO 08-05 23:12:43 serving_chat.py:150]         "name": "get_current_weather",
INFO 08-05 23:12:43 serving_chat.py:150]         "description": "Get the current weather in a given location",
INFO 08-05 23:12:43 serving_chat.py:150]         "parameters": {
INFO 08-05 23:12:43 serving_chat.py:150]             "type": "object",
INFO 08-05 23:12:43 serving_chat.py:150]             "properties": {
INFO 08-05 23:12:43 serving_chat.py:150]                 "city": {
INFO 08-05 23:12:43 serving_chat.py:150]                     "type": "string",
INFO 08-05 23:12:43 serving_chat.py:150]                     "description": "The city to find the weather for, e.g. 'San Francisco'"
INFO 08-05 23:12:43 serving_chat.py:150]                 },
INFO 08-05 23:12:43 serving_chat.py:150]                 "state": {
INFO 08-05 23:12:43 serving_chat.py:150]                     "type": "string",
INFO 08-05 23:12:43 serving_chat.py:150]                     "description": "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
INFO 08-05 23:12:43 serving_chat.py:150]                 },
INFO 08-05 23:12:43 serving_chat.py:150]                 "unit": {
INFO 08-05 23:12:43 serving_chat.py:150]                     "type": "string",
INFO 08-05 23:12:43 serving_chat.py:150]                     "description": "The unit to fetch the temperature in",
INFO 08-05 23:12:43 serving_chat.py:150]                     "enum": [
INFO 08-05 23:12:43 serving_chat.py:150]                         "celsius",
INFO 08-05 23:12:43 serving_chat.py:150]                         "fahrenheit"
INFO 08-05 23:12:43 serving_chat.py:150]                     ]
INFO 08-05 23:12:43 serving_chat.py:150]                 }
INFO 08-05 23:12:43 serving_chat.py:150]             }
INFO 08-05 23:12:43 serving_chat.py:150]         }
INFO 08-05 23:12:43 serving_chat.py:150]     }
INFO 08-05 23:12:43 serving_chat.py:150] }
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Hi! How are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] I'm doing well! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] Can you tell me what the weather will be in Dallas and San Francisco? I like fahrenheit.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 serving_chat.py:150] 
INFO 08-05 23:12:43 logger.py:36] Received request chat-30088efd6e3645e2b07ea083bb9d7446: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nEnvironment: ipython\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGiven the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.\n\nRespond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.\n\n{\n    "type": "function",\n    "function": {\n        "name": "get_current_weather",\n        "description": "Get the current weather in a given location",\n        "parameters": {\n            "type": "object",\n            "properties": {\n                "city": {\n                    "type": "string",\n                    "description": "The city to find the weather for, e.g. \'San Francisco\'"\n                },\n                "state": {\n                    "type": "string",\n                    "description": "the two-letter abbreviation for the state that the city is in, e.g. \'CA\' which would mean \'California\'"\n                },\n                "unit": {\n                    "type": "string",\n                    "description": "The unit to fetch the temperature in",\n                    "enum": [\n                        "celsius",\n                        "fahrenheit"\n                    ]\n                }\n            }\n        }\n    }\n}\n\nHi! How are you doing today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI\'m doing well! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you tell me what the weather will be in Dallas and San Francisco? I like fahrenheit.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=130751, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 13013, 25, 6125, 27993, 198, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 128009, 128006, 882, 128007, 271, 22818, 279, 2768, 5865, 11, 4587, 6013, 449, 264, 4823, 369, 264, 734, 1650, 449, 1202, 6300, 6105, 430, 1888, 11503, 279, 2728, 10137, 382, 66454, 304, 279, 3645, 5324, 609, 794, 734, 836, 11, 330, 14105, 794, 11240, 315, 5811, 836, 323, 1202, 907, 7966, 5519, 539, 1005, 7482, 382, 517, 262, 330, 1337, 794, 330, 1723, 761, 262, 330, 1723, 794, 341, 286, 330, 609, 794, 330, 456, 11327, 70464, 761, 286, 330, 4789, 794, 330, 1991, 279, 1510, 9282, 304, 264, 2728, 3813, 761, 286, 330, 14105, 794, 341, 310, 330, 1337, 794, 330, 1735, 761, 310, 330, 13495, 794, 341, 394, 330, 9103, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 791, 3363, 311, 1505, 279, 9282, 369, 11, 384, 1326, 13, 364, 24661, 13175, 42265, 394, 1173, 394, 330, 2513, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 1820, 1403, 80468, 72578, 369, 279, 1614, 430, 279, 3363, 374, 304, 11, 384, 1326, 13, 364, 5158, 6, 902, 1053, 3152, 364, 46510, 42265, 394, 1173, 394, 330, 3928, 794, 341, 504, 330, 1337, 794, 330, 928, 761, 504, 330, 4789, 794, 330, 791, 5089, 311, 7963, 279, 9499, 304, 761, 504, 330, 9195, 794, 2330, 667, 330, 66, 41347, 761, 667, 330, 69, 49010, 702, 504, 5243, 394, 457, 310, 457, 286, 457, 262, 457, 633, 13347, 0, 2650, 527, 499, 3815, 3432, 30, 128009, 128006, 78191, 128007, 271, 40, 2846, 3815, 1664, 0, 2650, 649, 358, 1520, 499, 30, 128009, 128006, 882, 128007, 271, 6854, 499, 3371, 757, 1148, 279, 9282, 690, 387, 304, 19051, 323, 5960, 13175, 30, 358, 1093, 282, 49010, 13, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 08-05 23:12:43 async_llm_engine.py:174] Added request chat-30088efd6e3645e2b07ea083bb9d7446.
ERROR 08-05 23:12:45 async_llm_engine.py:57] Engine background task failed
ERROR 08-05 23:12:45 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return fn(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 1534, in dot
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1355, in dot
ERROR 08-05 23:12:45 async_llm_engine.py:57]     assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1328, in assert_dtypes_valid
ERROR 08-05 23:12:45 async_llm_engine.py:57]     assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!"
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57] AssertionError: First input (fp16) and second input (uint8) must have the same dtype!
ERROR 08-05 23:12:45 async_llm_engine.py:57] 
ERROR 08-05 23:12:45 async_llm_engine.py:57] The above exception was the direct cause of the following exception:
ERROR 08-05 23:12:45 async_llm_engine.py:57] 
ERROR 08-05 23:12:45 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return_value = task.result()
ERROR 08-05 23:12:45 async_llm_engine.py:57]                    ^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
ERROR 08-05 23:12:45 async_llm_engine.py:57]     result = task.result()
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 585, in engine_step
ERROR 08-05 23:12:45 async_llm_engine.py:57]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-05 23:12:45 async_llm_engine.py:57]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 254, in step_async
ERROR 08-05 23:12:45 async_llm_engine.py:57]     output = await self.model_executor.execute_model_async(
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/executor/gpu_executor.py", line 159, in execute_model_async
ERROR 08-05 23:12:45 async_llm_engine.py:57]     output = await make_async(self.driver_worker.execute_model
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 08-05 23:12:45 async_llm_engine.py:57]     result = self.fn(*self.args, **self.kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 08-05 23:12:45 async_llm_engine.py:57]     output = self.model_runner.execute_model(
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 08-05 23:12:45 async_llm_engine.py:57]     hidden_or_intermediate_states = model_executable(
ERROR 08-05 23:12:45 async_llm_engine.py:57]                                     ^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 08-05 23:12:45 async_llm_engine.py:57]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     hidden_states, residual = layer(
ERROR 08-05 23:12:45 async_llm_engine.py:57]                               ^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     hidden_states = self.self_attn(
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     ^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 175, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 08-05 23:12:45 async_llm_engine.py:57]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/layer.py", line 98, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return self.impl.forward(query,
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/backends/xformers.py", line 603, in forward
ERROR 08-05 23:12:45 async_llm_engine.py:57]     out = PagedAttention.forward_prefix(
ERROR 08-05 23:12:45 async_llm_engine.py:57]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
ERROR 08-05 23:12:45 async_llm_engine.py:57]     context_attention_fwd(
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
ERROR 08-05 23:12:45 async_llm_engine.py:57]     _fwd_kernel[grid](
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 08-05 23:12:45 async_llm_engine.py:57]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run
ERROR 08-05 23:12:45 async_llm_engine.py:57]     kernel = self.compile(
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile
ERROR 08-05 23:12:45 async_llm_engine.py:57]     module = src.make_ir(options, codegen_fns, context)
ERROR 08-05 23:12:45 async_llm_engine.py:57]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57]   File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir
ERROR 08-05 23:12:45 async_llm_engine.py:57]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
ERROR 08-05 23:12:45 async_llm_engine.py:57]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-05 23:12:45 async_llm_engine.py:57] triton.compiler.errors.CompilationError: at 114:14:
ERROR 08-05 23:12:45 async_llm_engine.py:57]         off_v = (
ERROR 08-05 23:12:45 async_llm_engine.py:57]             bn[:, None] * stride_v_cache_bs +
ERROR 08-05 23:12:45 async_llm_engine.py:57]             cur_kv_head * stride_v_cache_h +
ERROR 08-05 23:12:45 async_llm_engine.py:57]             offs_d[None, :] * stride_v_cache_d +
ERROR 08-05 23:12:45 async_llm_engine.py:57]             (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
ERROR 08-05 23:12:45 async_llm_engine.py:57]         k = tl.load(K_cache + off_k,
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     mask=dim_mask[:, None] &
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
ERROR 08-05 23:12:45 async_llm_engine.py:57]                     other=0.0)  # [D,N]
ERROR 08-05 23:12:45 async_llm_engine.py:57] 
ERROR 08-05 23:12:45 async_llm_engine.py:57]         qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
ERROR 08-05 23:12:45 async_llm_engine.py:57]         qk += tl.dot(q, k)
ERROR 08-05 23:12:45 async_llm_engine.py:57]               ^
Exception in callback _log_task_completion(error_callback=<bound method...7f999479a510>>)(<Task finishe...de320>, None)>) at /mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f999479a510>>)(<Task finishe...de320>, None)>) at /mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 35, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/core.py", line 1534, in dot
    return semantic.dot(input, other, acc, input_precision, max_num_imprecise_acc, out_dtype, _builder)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1355, in dot
    assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/language/semantic.py", line 1328, in assert_dtypes_valid
    assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!"
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: First input (fp16) and second input (uint8) must have the same dtype!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 585, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 254, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/executor/gpu_executor.py", line 159, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/worker/worker_base.py", line 273, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 422, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 322, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 245, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/model_executor/models/llama.py", line 175, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/backends/xformers.py", line 603, in forward
    out = PagedAttention.forward_prefix(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/paged_attn.py", line 208, in forward_prefix
    context_attention_fwd(
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/attention/ops/prefix_prefill.py", line 765, in context_attention_fwd
    _fwd_kernel[grid](
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/runtime/jit.py", line 662, in run
    kernel = self.compile(
             ^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 276, in compile
    module = src.make_ir(options, codegen_fns, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/triton/compiler/compiler.py", line 113, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
triton.compiler.errors.CompilationError: at 114:14:
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k = tl.load(K_cache + off_k,
                    mask=dim_mask[:, None] &
                    ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
                    other=0.0)  # [D,N]

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
        qk += tl.dot(q, k)
              ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/mnt/disk/AI/constellate-vllm/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-05 23:12:45 async_llm_engine.py:181] Aborted request chat-30088efd6e3645e2b07ea083bb9d7446.
INFO:     10.3.10.164:65191 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/api_server.py", line 191, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/serving_chat.py", line 241, in create_chat_completion
    generator = await self.chat_completion_full_generator(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/serving_chat.py", line 569, in chat_completion_full_generator
    async for res in result_generator:
  File "/mnt/disk/AI/constellate-vllm/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
triton.compiler.errors.CompilationError: at 114:14:
        off_v = (
            bn[:, None] * stride_v_cache_bs +
            cur_kv_head * stride_v_cache_h +
            offs_d[None, :] * stride_v_cache_d +
            (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
        k = tl.load(K_cache + off_k,
                    mask=dim_mask[:, None] &
                    ((start_n + offs_n[None, :]) < cur_batch_ctx_len),
                    other=0.0)  # [D,N]

        qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)  # [M,N]
        qk += tl.dot(q, k)
              ^
INFO 08-05 23:12:50 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-05 23:13:00 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
^CINFO 08-05 23:13:06 server.py:207] vLLM ZMQ RPC Server was interrupted.
INFO 08-05 23:13:06 api_server.py:396] Gracefully stopping http server

@K-Mistele
Copy link
Contributor

(on a v100 tesla hence the fp16 instead of bf16)

@jon-chuang
Copy link
Contributor

jon-chuang commented Aug 6, 2024

I know it's super long but here's the full trace:

Seems like triton kernel issue, looks fixable. Let me take a look.

Also:

# TODO(Hai) this triton kernel has regression issue (broke) to

Also, is this comment still relevant?

# Currently only ROCm accepts kv-cache scaling factors

@jon-chuang
Copy link
Contributor

jon-chuang commented Aug 6, 2024

Related: #3880, #3156, #3880

@jon-chuang
Copy link
Contributor

jon-chuang commented Aug 6, 2024

This PR is directly relevant as the solution: #3234 (although the PR is incomplete and incorrect)

Additional context on FP8 KV Cache: #4532 with solution: #4893

@jon-chuang
Copy link
Contributor

Btw, why is this not on the testing path? Where should such a test be included as regression test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants