Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

Closed
markkofler opened this issue Jun 25, 2024 · 5 comments · Fixed by #10184
Closed

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827

markkofler opened this issue Jun 25, 2024 · 5 comments · Fixed by #10184
Labels
bug Something isn't working keep-open Prevents stale label being applied

Comments

@markkofler
Copy link

Your current environment

Using latest available docker image: vllm/vllm-openai:v0.5.0.post1

🐛 Describe the bug

I am getting as response "Internal Server Error" when calling the /v1/embeddings endpoint of the Kubernetes-deployed version of the model x. I am using the following json request as body:

{
  "model": "/mnt/models/",
  "input": [
    "test"
  ],
  "user": "user"
}

For reference, here is the log of the vLLM container:

INFO 06-25 14:21:47 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-25 14:21:47 api_server.py:178] args: Namespace(host=None, port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/mnt/models/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir='/models-cache', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 06-25 14:21:47 config.py:1197] Casting torch.float32 to torch.float16.
INFO 06-25 14:21:47 config.py:1218] Downcasting torch.float32 to torch.float16.
INFO 06-25 14:21:47 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/mnt/models/', speculative_config=None, tokenizer='/mnt/models/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8000, download_dir='/models-cache', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-25 14:22:53 model_runner.py:160] Loading model weights took 14.2655 GB
INFO 06-25 14:22:54 gpu_executor.py:83] # GPU blocks: 27913, # CPU blocks: 4681
INFO 06-25 14:22:56 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-25 14:22:56 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-25 14:23:03 model_runner.py:965] Graph capturing finished in 7 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-25 14:23:03 serving_chat.py:92] Using default chat template:
INFO 06-25 14:23:03 serving_chat.py:92] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 06-25 14:23:03 serving_chat.py:92] ' + message['content'] + '<|im_end|>' + '
INFO 06-25 14:23:03 serving_chat.py:92] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 06-25 14:23:03 serving_chat.py:92] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-25 14:23:04 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO 06-25 14:23:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.6:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:     127.0.0.6:0 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.6:0 - "GET /openapi.json HTTP/1.1" 200 OK
INFO 06-25 14:23:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:23:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:23:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:23:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:24:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:24:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:24:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:25:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:25:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:25:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:24 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:26:34 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:26:44 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 404 Not Found
INFO 06-25 14:26:54 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:27:04 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.6:0 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
INFO 06-25 14:27:14 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-25 14:27:21 async_llm_engine.py:564] Received request cmpl-9ccf6a32602244ee80e83071838e6693-0: prompt: '!', params: PoolingParams(additional_metadata=string), prompt_token_ids: [0], lora_request: None.
ERROR 06-25 14:27:21 async_llm_engine.py:52] Engine background task failed
ERROR 06-25 14:27:21 async_llm_engine.py:52] Traceback (most recent call last):
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return_value = task.result()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
ERROR 06-25 14:27:21 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return fut.result()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
ERROR 06-25 14:27:21 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-25 14:27:21 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-25 14:27:21 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 735, in execute_model
ERROR 06-25 14:27:21 async_llm_engine.py:52]     ) = self.prepare_input_tensors(seq_group_metadata_list)
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in prepare_input_tensors
ERROR 06-25 14:27:21 async_llm_engine.py:52]     sampling_metadata = SamplingMetadata.prepare(
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 116, in prepare
ERROR 06-25 14:27:21 async_llm_engine.py:52]     ) = _prepare_seq_groups(seq_group_metadata_list, seq_lens, query_lens,
ERROR 06-25 14:27:21 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 208, in _prepare_seq_groups
ERROR 06-25 14:27:21 async_llm_engine.py:52]     if sampling_params.seed is not None:
ERROR 06-25 14:27:21 async_llm_engine.py:52] AttributeError: 'NoneType' object has no attribute 'seed'
Exception in callback functools.partial(<function _log_task_completion at 0x7f9a51e90b80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9a3a136560>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f9a51e90b80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9a3a136560>>)>

Would be great if somebody could help me to get the model running as embedding model for our colleagues. Any idea what could be wrong?

Thanks in advance!

@markkofler markkofler added the bug Something isn't working label Jun 25, 2024
@mgoin
Copy link
Member

mgoin commented Jun 25, 2024

@markkofler vLLM doesn't have support for Qwen2 embedding models yet. I have a WIP PR here for this: #5611

From the logs I see the model is not running as an embedding model WARNING 06-25 14:23:04 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.

Looking at the config file for the model, it seems that it is registered to run as Qwen2ForCausalLM, which is not valid for embedding models: https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/blob/a3b5d14542d49d8e202dcf1c382a692b1607cee5/config.json#L3

You can try changing this to be Qwen2Model, which is what vLLM expects for embedding models, and run it with my PR (which you would have to build from source).

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Oct 25, 2024
@markkofler
Copy link
Author

@mgoin Thanks for the feedback! I would leave this issue open till the model is supported, as I see some pull requests are open for adding the support.

@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Oct 26, 2024
@hmellor
Copy link
Member

hmellor commented Oct 26, 2024

Ok @markkofler I'll mark this as keep-open.

Could you link the PRs here?

@hmellor hmellor added keep-open Prevents stale label being applied and removed unstale Recieved activity after being labelled stale labels Oct 26, 2024
@markkofler
Copy link
Author

Thanks @hmellor ! The PRs are already linked above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working keep-open Prevents stale label being applied
Projects
None yet
3 participants