[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

markkofler · 2024-06-25T14:40:21Z

Your current environment

Using latest available docker image: vllm/vllm-openai:v0.5.0.post1

🐛 Describe the bug

I am getting as response "Internal Server Error" when calling the /v1/embeddings endpoint of the Kubernetes-deployed version of the model x. I am using the following json request as body:

{
  "model": "/mnt/models/",
  "input": [
    "test"
  ],
  "user": "user"
}

For reference, here is the log of the vLLM container:

INFO 06-25 14:21:47 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-25 14:21:47 api_server.py:178] args: Namespace(host=None, port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/mnt/models/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir='/models-cache', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 06-25 14:21:47 config.py:1197] Casting torch.float32 to torch.float16.
INFO 06-25 14:21:47 config.py:1218] Downcasting torch.float32 to torch.float16.
INFO 06-25 14:21:47 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/mnt/models/', speculative_config=None, tokenizer='/mnt/models/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8000, download_dir='/models-cache', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-25 14:22:53 model_runner.py:160] Loading model weights took 14.2655 GB
INFO 06-25 14:22:54 gpu_executor.py:83] # GPU blocks: 27913, # CPU blocks: 4681
INFO 06-25 14:22:56 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-25 14:22:56 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-25 14:23:03 model_runner.py:965] Graph capturing finished in 7 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-25 14:23:03 serving_chat.py:92] Using default chat template:
INFO 06-25 14:23:03 serving_chat.py:92] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 06-25 14:23:03 serving_chat.py:92] ' + message['content'] + '<|im_end|>' + '
INFO 06-25 14:23:03 serving_chat.py:92] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 06-25 14:23:03 serving_chat.py:92] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-25 14:23:04 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO 06-25 14:23:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.6:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     127.0.0.6:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.6:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO 06-25 14:23:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:23:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:23:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:23:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:24:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:24:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:25:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:25:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:26:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 404 Not Found
INFO 06-25 14:26:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:27:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:27:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:27:21 async_llm_engine.py:564] Received request cmpl-9ccf6a32602244ee80e83071838e6693-0: prompt: '!', params: PoolingParams(additional_metadata=string), prompt_token_ids: [0], lora_request: None.
ERROR 06-25 14:27:21 async_llm_engine.py:52] Engine background task failed
ERROR 06-25 14:27:21 async_llm_engine.py:52] Traceback (most recent call last):
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return_value = task.result()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
ERROR 06-25 14:27:21 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return fut.result()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
ERROR 06-25 14:27:21 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-25 14:27:21 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 735, in execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]     ) = self.prepare_input_tensors(seq_group_metadata_list)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in prepare_input_tensors
ERROR 06-25 14:27:21 async_llm_engine.py:52]     sampling_metadata = SamplingMetadata.prepare(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 116, in prepare
ERROR 06-25 14:27:21 async_llm_engine.py:52]     ) = _prepare_seq_groups(seq_group_metadata_list, seq_lens, query_lens,
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 208, in _prepare_seq_groups
ERROR 06-25 14:27:21 async_llm_engine.py:52]     if sampling_params.seed is not None:
ERROR 06-25 14:27:21 async_llm_engine.py:52] AttributeError: 'NoneType' object has no attribute 'seed'
Exception in callback functools.partial(<function _log_task_completion at 0x7f9a51e90b80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9a3a136560>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f9a51e90b80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9a3a136560>>)>

Would be great if somebody could help me to get the model running as embedding model for our colleagues. Any idea what could be wrong?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

mgoin · 2024-06-25T15:58:47Z

@markkofler vLLM doesn't have support for Qwen2 embedding models yet. I have a WIP PR here for this: #5611

From the logs I see the model is not running as an embedding model WARNING 06-25 14:23:04 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.

Looking at the config file for the model, it seems that it is registered to run as Qwen2ForCausalLM, which is not valid for embedding models: https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/blob/a3b5d14542d49d8e202dcf1c382a692b1607cee5/config.json#L3

You can try changing this to be Qwen2Model, which is what vLLM expects for embedding models, and run it with my PR (which you would have to build from source).

github-actions · 2024-10-25T02:03:59Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

markkofler · 2024-10-25T15:33:28Z

@mgoin Thanks for the feedback! I would leave this issue open till the model is supported, as I see some pull requests are open for adding the support.

hmellor · 2024-10-26T14:07:55Z

Ok @markkofler I'll mark this as keep-open.

Could you link the PRs here?

markkofler · 2024-10-31T12:59:46Z

Thanks @hmellor ! The PRs are already linked above.

markkofler added the bug Something isn't working label Jun 25, 2024

DarkLight1337 mentioned this issue Jun 26, 2024

[Model] Add support for Qwen2 for embeddings #5611

Closed

0xWelt mentioned this issue Jul 10, 2024

[Model] Add support for 'gte-Qwen2' embedding models #6282

Closed

noooop mentioned this issue Sep 25, 2024

[RFC]: Support encode only models by Workflow Defined Engine #8453

Closed

1 task

github-actions bot added the stale Over 90 days of inactivity label Oct 25, 2024

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Oct 26, 2024

hmellor added keep-open Prevents stale label being applied and removed unstale Recieved activity after being labelled stale labels Oct 26, 2024

DarkLight1337 mentioned this issue Nov 9, 2024

[Model] Support Qwen2 embeddings and use tags to select model tests #10184

Merged

simon-mo closed this as completed in #10184 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

markkofler commented Jun 25, 2024

mgoin commented Jun 25, 2024

github-actions bot commented Oct 25, 2024

markkofler commented Oct 25, 2024

hmellor commented Oct 26, 2024

markkofler commented Oct 31, 2024

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

Comments

markkofler commented Jun 25, 2024

Your current environment

🐛 Describe the bug

mgoin commented Jun 25, 2024

github-actions bot commented Oct 25, 2024

markkofler commented Oct 25, 2024

hmellor commented Oct 26, 2024

markkofler commented Oct 31, 2024