You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running as part of a Docker Compose stack. Relevant sections of my docker-compose.yaml are below. This is part of a multi-model deployment with other vLLM-based text generation/chat models running successfully behind a Traefik reverse proxy. I split out the instance running LLaVa 1.6 into its own service in the docker-compose.yaml to test the different commands it requires passed in on startup, it is the third service in the file. I have included the .env file entries as well.
###docker-compose.yaml###
services:
reverseproxy:
image: ${PROXY_IMAGE}
container_name: reverseproxy
# Enables the web UI and tells Traefik to listen to docker
command: --api.insecure=true --providers.docker --api.dashboard=true
ports:
# The HTTP port
- "80:80"
# The Web UI (enabled by --api.insecure=true)
- "8080:8080"
volumes:
# So that Traefik can listen to the Docker events
- /var/run/docker.sock:/var/run/docker.sock
networks:
- llm-net
## Current best solution for chat/text generation models
## Change GPU device_ids if necessary
vllm-server:
depends_on:
- reverseproxy
image: ${VLLM_IMAGE}
container_name: vllm-server
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
volumes:
- ${MODEL_VOL}/${VLLM_MODEL_ID}:/vllm-workspace/${VLLM_MODEL_ID}
command: ["--model", "${VLLM_MODEL_ID}", "--gpu-memory-utilization", "0.75", "--host", "0.0.0.0", "--root-path", "/vllm-server"]
labels:
- traefik.enable=true
- traefik.http.routers.vllm-server.rule=PathPrefix(`/vllm-server`)
- traefik.http.routers.vllm-server.middlewares=vllm-server-stripprefix
- traefik.http.middlewares.vllm-server-stripprefix.stripprefix.prefixes=/vllm-server
- traefik.http.services.vllm-server.loadbalancer.server.port=8000
networks:
- llm-net
# ports:
# - 8000:8000
## Testing llava serving with vllm
## Change GPU device_ids if necessary
vllm-llava-server:
depends_on:
- reverseproxy
image: ${VLLM_IMAGE}
container_name: vllm-llava-server
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
volumes:
- ${MODEL_VOL}/${VLLM_IMAGE_MODEL_ID}:/vllm-workspace/${VLLM_IMAGE_MODEL_ID}
command: ["--model", "${VLLM_IMAGE_MODEL_ID}", "--gpu-memory-utilization", "0.75", "--host", "0.0.0.0", "--root-path", "/vllm-llava-server",
"--image-input-type", "pixel_values", "--image-token-id", "32000", "--image-input-shape", "1,3,336,336", "--image-feature-size", "576",
"--chat-template", "template_llava.jinja"]
labels:
- traefik.enable=true
- traefik.http.routers.vllm-llava-server.rule=PathPrefix(`/vllm-llava-server`)
- traefik.http.routers.vllm-llava-server.middlewares=vllm-llava-server-stripprefix
- traefik.http.middlewares.vllm-llava-server-stripprefix.stripprefix.prefixes=/vllm-llava-server
- traefik.http.services.vllm-llava-server.loadbalancer.server.port=8000
networks:
- llm-net
# ports:
# - 8000:8000
On starting the service with docker compose --env-file .env.llava up reverseproxy vllm-llava-server, it appears to do the usual startup, but then throws a ValueError, see below for full text and STDOUT. I have included all startup values that appear to be required when instantiating a new LLM object from https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py, am I missing something from my command entry in the docker-compose.yaml?
vllm-llava-server | INFO 06-26 18:28:25 api_server.py:177] vLLM API server version 0.5.0.post1 vllm-llava-server | INFO 06-26 18:28:25 api_server.py:178] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path='/vllm-llava-server', middleware=[], model='llava-v1.6-mistral-7b-hf', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.75, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=576, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
vllm-llava-server | INFO 06-26 18:28:25 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='llava-v1.6-mistral-7b-hf', speculative_config=None, tokenizer='llava-v1.6-mistral-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=llava-v1.6-mistral-7b-hf)
vllm-llava-server | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm-llava-server | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
vllm-llava-server | INFO 06-26 18:29:15 model_runner.py:160] Loading model weights took 14.1020 GB
vllm-llava-server | [rank0]: Traceback (most recent call last):
vllm-llava-server | [rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
vllm-llava-server | [rank0]: return _run_code(code, main_globals, None,
vllm-llava-server | [rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
vllm-llava-server | [rank0]: exec(code, run_globals)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196, in <module>
vllm-llava-server | [rank0]: engine = AsyncLLMEngine.from_engine_args(
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
vllm-llava-server | [rank0]: engine = cls(
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
vllm-llava-server | [rank0]: self.engine = self._init_engine(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
vllm-llava-server | [rank0]: return engine_class(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236, in __init__
vllm-llava-server | [rank0]: self._initialize_kv_caches()
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
vllm-llava-server | [rank0]: self.model_executor.determine_num_available_blocks())
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks
vllm-llava-server | [rank0]: return self.driver_worker.determine_num_available_blocks()
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server | [rank0]: return func(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
vllm-llava-server | [rank0]: self.model_runner.profile_run()
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server | [rank0]: return func(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844, in profile_run
vllm-llava-server | [rank0]: self.execute_model(seqs, kv_caches)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
vllm-llava-server | [rank0]: return func(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
vllm-llava-server | [rank0]: hidden_states = model_executable(
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
vllm-llava-server | [rank0]: return self._call_impl(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
vllm-llava-server | [rank0]: return forward_call(*args, **kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava_next.py", line 383, in forward
vllm-llava-server | [rank0]: image_input = self._parse_and_validate_image_input(**kwargs)
vllm-llava-server | [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava_next.py", line 196, in _parse_and_validate_image_input
vllm-llava-server | [rank0]: raise ValueError("Incorrect type of image sizes. "
vllm-llava-server | [rank0]: ValueError: Incorrect type of image sizes. Got type: <class 'NoneType'>
vllm-llava-server exited with code 0
The text was updated successfully, but these errors were encountered:
Sure, do you have a recommended way to build the container? Just do the usual clone and Docker build on the branch or does your team have any build magic happening that I need to know about? Right now I'm just pulling straight from Docker Hub.
Your current environment
Current Environment
Docker image:
vllm/vllm-openai:v0.5.0.post1
Running as part of a Docker Compose stack. Relevant sections of my
docker-compose.yaml
are below. This is part of a multi-model deployment with other vLLM-based text generation/chat models running successfully behind a Traefik reverse proxy. I split out the instance running LLaVa 1.6 into its own service in thedocker-compose.yaml
to test the different commands it requires passed in on startup, it is the third service in the file. I have included the .env file entries as well.VLLM_IMAGE_MODEL_ID
points to a cloned Huggingface directory from https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf (withtemplate_llava.jinja
added) that has directory structure:🐛 Describe the bug
Bug description
On starting the service with
docker compose --env-file .env.llava up reverseproxy vllm-llava-server
, it appears to do the usual startup, but then throws aValueError
, see below for full text and STDOUT. I have included all startup values that appear to be required when instantiating a newLLM
object from https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py, am I missing something from mycommand
entry in thedocker-compose.yaml
?The text was updated successfully, but these errors were encountered: