-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Feedback Thread #12568
Comments
👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. |
Does anyone know about this bug with n>1? Thanks |
Logging is in progress. Current main has a lot more and we will maintain compatibility with V0. Thanks! |
Quick feedback [VLLM_USE_V1=1]:
|
Thanks, both are in progress |
are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? |
Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. |
Still in progress |
Thanks for fixing metrics logs in 0.7.1! |
I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1? |
The V1 engine doesn't seem to support logits processors or min-p filtering. Issue #12678 |
Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently: vllm 0.7.0 on 2x A6000: Starting normally a 32b-awq model and using Everything works as previously, GPUs both get to ~44-46GB usage Using GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU. Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x Using V1 I get:
And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1. |
I did a little experiment with DeepSeek-R1 on 8xH200 GPU. vLLM 0.7.0 showed the following results with
In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with
Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!
But running vLLM with
|
v1 not support T4,are you support? |
Hi @bao231, V1 does not support T4 or older-generation GPUs since the kernel libraries used in V1 (e.g., flash-attn) do not support them. |
V1 support other attention libs?has you plan? @WoosukKwon |
Thanks!
|
Can you provide a more detailed reproduction instruction? cc @WoosukKwon |
Thanks. We are actively working on PP |
Check out #sig-multi-modality in our slack! This is the best place for a discussion like this |
Its pretty hard to follow what you are seeing. Please attach:
Thanks! |
Hi, please see Launch command
|
I ran the following code after upgrading the V1 version vllm and encountered an error: However, if --tensor_parallel_size" is set to 1, it works fine. Is there a compatibility issue with the v1 version with the multi-card deployment model? |
With dual rtx3090 in V1: CUDA out of memory. Tried to allocate 594.00 MiB. GPU 0 has a total capacity of 23.48 GiB of which 587.38 MiB is free. Including non-PyTorch memory, this process has 22.89 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 815.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation With v0 it works, something changed about memory in V1. |
Will V1 support flashinfer in the future? |
Does V1 support FP8 (W8A8) quantization? I tried nm-testing/Qwen2-VL-7B-Instruct-FP8-dynamic on v0.7.1 V1 arch, no error thrown but got gibberish result. Same code and model works properly on v0.7.1 V0 arch. UPDATE: it works on v0.7.1 V1 arch eager mode, but borken on v0.7.1 V1 arch torch.compiled mode. I'm figuring out if this problem is model-dependent or not. UPDATE: tried another model nm-testing/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic and same bug presents on v0.7.1 V1 arch torch.compiled mode UPDATE: it works after i turned custom_ops on (change Lines 3237 to 3249 in 3ee696a
|
When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results. I tested two samples: I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs. Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B? |
cc @ywang96 |
@lyhh123 can you open a separate issue for this and share some examples? There are multiple layers so I want to take a look where the issue might be.
This is also an interesting obversation since the V1-rearch for multimodal models should be model-agnostic, so I'm curious to see where the problem comes from. |
Thank you for paying attention to my issue. Two days ago, I encountered this problem during testing. Over the past two days, I have made a series of attempts to adjust the sampling parameters, mainly by modifying top_p or other parameters to maintain output stability as much as possible. Currently, I have re-tested using the v1/default model and the qwen2.5vl-3B model. Apart from content related to coordinates, the outputs have remained largely consistent. I attempted to adjust the parameters but was unable to reproduce the issue from two days ago. I still remember that, with fixed parameters at that time, there were unexpected differences across multiple outputs between the v1 and default modes. However, I cannot rule out the possibility of other potential variables affecting the results at that time. I will do my best to identify the root cause of the issue, and if I make any relevant discoveries, I will update you promptly. |
@robertgshaw2-redhat Hi now can we use V1 get high generate token throughput than V0 on Deepseek-R1? |
@imkero Is the bug fixed now (without the change you suggested)? I wasn't able to reproduce the bug with the latest main. |
I have made a mistake. I found that it's |
Is it possible to update Command R with v1 support ? |
Hello, in our production environment, we run Qwen2.5 with a tokenizer adapted for the language of our country. Currently, we are scaling across the entire company, and we are trying to analyze the performance of vLLM/Sglang. Unfortunately, v1 does not support Qwen 2.5. Moreover, in our production environment, structured JSON output and speculative decoding are actively used, and we really need these features together. |
Will the changes made in vllm-project/flash-attention be eventually sync'd with Dao-AILab/flash-attention? |
On ce77eb9, in my case the CUDA graph recorded through torch.compile using For example, launching vLLM with lm_eval \
--model local-completions \
--model_args model=meta-llama/Llama-2-7b-chat-hf,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=9999999,timeout=3600,tokenized_requests=False \
--tasks gsm8k \
--limit 32 \
--gen_kwargs "do_sample=False,top_p=1,temperature=1" works well (even though later on some sequences in the batch hit EOS and we'd get e.g. to 31 tokens that will get padded), however, using: lm_eval \
--model local-completions \
--model_args model=meta-llama/Llama-2-7b-chat-hf,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=9999999,timeout=3600,tokenized_requests=False \
--tasks gsm8k \
--limit 30 \
--gen_kwargs "do_sample=False,top_p=1,temperature=1" that first hits a cuda graph with some padding (precisely 2 padding tokens), this gets stuck, as soon as one tries to sync the model outputs from device (e.g. printing the vllm/vllm/v1/worker/gpu_model_runner.py Line 948 in ce77eb9
gpu utilization is shown in rocm-smi. Hitting the issue both on It might only be a ROCm issue though. |
@SageMoore FYI |
i use 2 * 8 A100(40G)start success,for me, the key is --gpu-memory-utilization 0.8 --quantization moe_wna16. speed 18.6~18.8 TPS, note my cards without nvlink. |
root@ecm-bb90:/var/model# pip show vllm root@ecm-bb90:/var/model# pip show transformers use: CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8212 --tensor-parallel-size 4 --model /var/model/Qwen2.5-VL-72B-Instruct --gpu-memory-utilization 0.9 --max-model-len 8192 --served-model-name "Qwen2.5-VL-72B-Instruct root@ecm-bb90:/var/model# tail -f vllmoutput2.log how can i do. |
I am trying to run inference on Qwen2.5-VL-72B for video processing using 4xA800 GPUs. However, I encountered errors when executing the code with VLLM V1, whereas it works correctly with VLLM V0 by setting VLLM_USE_V1=0. llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10, "video": 10},
tensor_parallel_size=4,
gpu_memory_utilization=0.7
)
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.001,
repetition_penalty=1.05,
max_tokens=256,
stop_token_ids=[],
)
question = ''
messages = [
{"role": "system", "content": "You are a good video analyst"},
{
"role": "user",
"content": [
{
"type": "video",
"video": file,
},
{"type": "text", "text": question},
],
}
]
prompt = self.processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
# FPS will be returned in video_kwargs
#"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
|
Cannot work with Qwen 2 #13284 |
CUDA error when using |
Somehow GGUF isn't working great, it will crash the V1 engine by weird CUDA errors by OOM errors but no such errors will be present via V0. |
On the 28h20 machine, I start docker with the following compose file Node1 services:
v3-1-vllm:
container_name: v3-1-vllm
image: vllm/vllm-openai:latest
privileged: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
shm_size: "1024g"
ipc: "host"
network_mode: "host"
volumes:
- /data/deepseek-v3:/root/.cache/huggingface
- /data/torchcache-v3:/root/torchcache
environment:
- VLLM_HOST_IP=10.0.251.33
- GLOO_SOCKET_IFNAME=ens12f0np0
- NCCL_SOCKET_IFNAME=ibs1
- NCCL_IB_ALLOW=1
- NCCL_IB_DISABLE=0
- NCCL_IB_CUDA_SUPPORT=1
- NCCL_IB_HCA=ibp1
- NCCL_IB_RETRY_CNT=13
- NCCL_IB_GID_INDEX=3
- NCCL_NET_GDR_LEVEL=2
- NCCL_IB_TIMEOUT=22
- NCCL_DEBUG=INFO
- NCCL_P2P_LEVEL=NVL
- NCCL_CROSS_NIC=1
- NCCL_NET_GDR_LEVEL=SYS
entrypoint:
- /bin/bash
- -c
- |
(nohup ray start --disable-usage-stats --block --head --port=6379 > /init.log 2>&1 &)
sleep 10 && python3 -m vllm.entrypoints.openai.api_server \
--served-model-name "deepseek/deepseek-v3" \
--model /root/.cache/huggingface \
--host 0.0.0.0 \
--port 30000 \
--enable-prefix-caching \
--enable-chunked-prefill \
--pipeline-parallel-size 2 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--max-model-len 64128 \
--max-num-batched-tokens 8192 \
--scheduling-policy priority \
--trust-remote-code \
--max-num-seqs 12 \
--swap-space 16 \
--block_size 32 \
--disable_log_requests
restart: always Node2
But I have a problem
How should I specify the model name and model path? |
Please leave comments here about your usage of V1, does it work? does it not work? which feature do you need in order to adopt it? any bugs?
For bug report, please file it separately and link the issue here.
For in depth discussion, please feel free to join #sig-v1 in the vLLM Slack workspace.
The text was updated successfully, but these errors were encountered: