Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] Feedback Thread #12568

Open
simon-mo opened this issue Jan 30, 2025 · 48 comments
Open

[V1] Feedback Thread #12568

simon-mo opened this issue Jan 30, 2025 · 48 comments
Labels

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Jan 30, 2025

Please leave comments here about your usage of V1, does it work? does it not work? which feature do you need in order to adopt it? any bugs?

For bug report, please file it separately and link the issue here.

For in depth discussion, please feel free to join #sig-v1 in the vLLM Slack workspace.

@simon-mo simon-mo added the misc label Jan 30, 2025
@simon-mo simon-mo changed the title [V1] Feedback Threads [V1] Feedback Thread Jan 30, 2025
@simon-mo simon-mo removed the misc label Jan 30, 2025
@simon-mo simon-mo pinned this issue Jan 30, 2025
@wedobetter
Copy link

wedobetter commented Jan 30, 2025

👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT.
The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1.
#12529

@m-harmonic
Copy link

Does anyone know about this bug with n>1? Thanks
#12584

@robertgshaw2-redhat
Copy link
Collaborator

Does anyone know about this bug with n>1? Thanks #12584

Thanks, we are aware and have some ongoing PRs for it.

#10980

@robertgshaw2-redhat
Copy link
Collaborator

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1.

Logging is in progress. Current main has a lot more and we will maintain compatibility with V0. Thanks!

@dchichkov
Copy link

Quick feedback [VLLM_USE_V1=1]:

  • n > 1 would be nice

  • guided_grammar (or anything guided really) would be nice

@robertgshaw2-redhat
Copy link
Collaborator

Quick feedback [VLLM_USE_V1=1]:

  • n > 1 would be nice
  • guided_grammar (or anything guided really) would be nice

Thanks, both are in progress

@hibukipanim
Copy link

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)?
checking here before opening an issue to reproduce

@akshay-loci
Copy link

Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. get_multimodal_embeddings() expects that we return a list or tensor of length equal to the number of multimodal items provided in the batch and we then have to make unintuitive assumptions on how the output passed into get_input_embeddings would look like because the batching being used while calling both functions is not the same. It would be much nicer if for example the input and output of get_multimodal_embeddings are dicts with the keys being the different modalities.

@robertgshaw2-redhat
Copy link
Collaborator

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? checking here before opening an issue to reproduce

Still in progress

@wedobetter
Copy link

👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

Thanks for fixing metrics logs in 0.7.1!
Lack of pipeline parallelism in V1 is a show stopper for production deployments #11945

@Ouna-the-Dataweaver
Copy link

I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1?

@FrederickVu
Copy link

The V1 engine doesn't seem to support logits processors or min-p filtering. Issue #12678

@gmonair
Copy link

gmonair commented Feb 3, 2025

Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently:

vllm 0.7.0 on 2x A6000:

Starting normally a 32b-awq model and using --max-model-len 32768 --gpu-memory-utilization 0.98 --tensor-parallel 2 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768

Everything works as previously, GPUs both get to ~44-46GB usage

Using VLLM_USE_V1=1 and the exact same parameters as above:

GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU.

Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x

Using V1 I get:

INFO 02-02 23:26:19 kv_cache_utils.py:400] Maximum concurrency for 32768 tokens per request: **22.25x**

And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1.

@Xarbirus
Copy link

Xarbirus commented Feb 3, 2025

I did a little experiment with DeepSeek-R1 on 8xH200 GPU.

vLLM 0.7.0 showed the following results with benchmark_serving.py --backend openai --base-url http://0.0.0.0:8000 --dataset-name=random --model deepseek-ai/DeepSeek-R1

  • with VLLM_USE_V1=1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [07:53<00:00,  2.11it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  473.62    
Total input tokens:                      1024000   
Total generated tokens:                  119550    
Request throughput (req/s):              2.11      
Output token throughput (tok/s):         252.42    
Total Token throughput (tok/s):          2414.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          100636.33 
Median TTFT (ms):                        103588.53 
P99 TTFT (ms):                           197277.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.82    
Median TPOT (ms):                        172.14    
P99 TPOT (ms):                           363.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.08    
Median ITL (ms):                         136.46    
P99 ITL (ms):                            575.30    
==================================================
  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:24<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  324.29    
Total input tokens:                      1024000   
Total generated tokens:                  119163    
Request throughput (req/s):              3.08      
Output token throughput (tok/s):         367.46    
Total Token throughput (tok/s):          3525.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          29022.37  
Median TTFT (ms):                        32492.50  
P99 TTFT (ms):                           54457.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          125.16    
Median TPOT (ms):                        119.91    
P99 TPOT (ms):                           411.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.20    
Median ITL (ms):                         76.78     
P99 ITL (ms):                            656.11    
==================================================

In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with --request-rate 10 and got

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:16<00:00,  3.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  316.20    
Total input tokens:                      1024000   
Total generated tokens:                  119448    
Request throughput (req/s):              3.16      
Output token throughput (tok/s):         377.76    
Total Token throughput (tok/s):          3616.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          100122.09 
Median TTFT (ms):                        98699.05  
P99 TTFT (ms):                           201732.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.61    
Median TPOT (ms):                        104.30    
P99 TPOT (ms):                           1276.91   
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.90    
Median ITL (ms):                         76.35     
P99 ITL (ms):                            648.36    
==================================================

Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!

  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [04:29<00:00,  3.71it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  269.74    
Total input tokens:                      1024000   
Total generated tokens:                  119805    
Request throughput (req/s):              3.71      
Output token throughput (tok/s):         444.14    
Total Token throughput (tok/s):          4240.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          368.78    
Median TTFT (ms):                        269.07    
P99 TTFT (ms):                           3826.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          124.95    
Median TPOT (ms):                        122.03    
P99 TPOT (ms):                           214.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           123.32    
Median ITL (ms):                         75.30     
P99 ITL (ms):                            583.77    
==================================================
  • without VLLM_USE_V1 (with --request-rate 10)
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [02:26<00:00,  6.83it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  146.43    
Total input tokens:                      1024000   
Total generated tokens:                  119701    
Request throughput (req/s):              6.83      
Output token throughput (tok/s):         817.48    
Total Token throughput (tok/s):          7810.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          14575.11  
Median TTFT (ms):                        13606.50  
P99 TTFT (ms):                           29954.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          297.01    
Median TPOT (ms):                        282.46    
P99 TPOT (ms):                           1393.69   
---------------Inter-token Latency----------------
Mean ITL (ms):                           262.67    
Median ITL (ms):                         132.89    
P99 ITL (ms):                            2840.40   
==================================================

But running vLLM with VLLM_USE_V1=1 I got en error TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'q_lora_rank' with previous warnings like

`torch.compile` is turned on, but the model deepseek-ai/DeepSeek-R1 does not support it. Please open an issue on GitHubif you want it to be supported.

@bao231
Copy link

bao231 commented Feb 4, 2025

v1 not support T4,are you support?

@bao231
Copy link

bao231 commented Feb 4, 2025

@simon-mo

@WoosukKwon
Copy link
Collaborator

Hi @bao231, V1 does not support T4 or older-generation GPUs since the kernel libraries used in V1 (e.g., flash-attn) do not support them.

@bao231
Copy link

bao231 commented Feb 4, 2025

V1 support other attention libs?has you plan? @WoosukKwon

@robertgshaw2-redhat
Copy link
Collaborator

I did a little experiment with DeepSeek-R1 on 8xH200 GPU.

vLLM 0.7.0 showed the following results with benchmark_serving.py --backend openai --base-url http://0.0.0.0:8000 --dataset-name=random --model deepseek-ai/DeepSeek-R1

  • with VLLM_USE_V1=1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [07:53<00:00,  2.11it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  473.62    
Total input tokens:                      1024000   
Total generated tokens:                  119550    
Request throughput (req/s):              2.11      
Output token throughput (tok/s):         252.42    
Total Token throughput (tok/s):          2414.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          100636.33 
Median TTFT (ms):                        103588.53 
P99 TTFT (ms):                           197277.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.82    
Median TPOT (ms):                        172.14    
P99 TPOT (ms):                           363.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.08    
Median ITL (ms):                         136.46    
P99 ITL (ms):                            575.30    
==================================================
  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:24<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  324.29    
Total input tokens:                      1024000   
Total generated tokens:                  119163    
Request throughput (req/s):              3.08      
Output token throughput (tok/s):         367.46    
Total Token throughput (tok/s):          3525.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          29022.37  
Median TTFT (ms):                        32492.50  
P99 TTFT (ms):                           54457.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          125.16    
Median TPOT (ms):                        119.91    
P99 TPOT (ms):                           411.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.20    
Median ITL (ms):                         76.78     
P99 ITL (ms):                            656.11    
==================================================

In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with --request-rate 10 and got

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:16<00:00,  3.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  316.20    
Total input tokens:                      1024000   
Total generated tokens:                  119448    
Request throughput (req/s):              3.16      
Output token throughput (tok/s):         377.76    
Total Token throughput (tok/s):          3616.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          100122.09 
Median TTFT (ms):                        98699.05  
P99 TTFT (ms):                           201732.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.61    
Median TPOT (ms):                        104.30    
P99 TPOT (ms):                           1276.91   
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.90    
Median ITL (ms):                         76.35     
P99 ITL (ms):                            648.36    
==================================================

Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!

  • without VLLM_USE_V1 (with --request-rate 4)
Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [04:29<00:00,  3.71it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  269.74    
Total input tokens:                      1024000   
Total generated tokens:                  119805    
Request throughput (req/s):              3.71      
Output token throughput (tok/s):         444.14    
Total Token throughput (tok/s):          4240.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          368.78    
Median TTFT (ms):                        269.07    
P99 TTFT (ms):                           3826.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          124.95    
Median TPOT (ms):                        122.03    
P99 TPOT (ms):                           214.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           123.32    
Median ITL (ms):                         75.30     
P99 ITL (ms):                            583.77    
==================================================
  • without VLLM_USE_V1 (with --request-rate 10)
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [02:26<00:00,  6.83it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  146.43    
Total input tokens:                      1024000   
Total generated tokens:                  119701    
Request throughput (req/s):              6.83      
Output token throughput (tok/s):         817.48    
Total Token throughput (tok/s):          7810.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          14575.11  
Median TTFT (ms):                        13606.50  
P99 TTFT (ms):                           29954.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          297.01    
Median TPOT (ms):                        282.46    
P99 TPOT (ms):                           1393.69   
---------------Inter-token Latency----------------
Mean ITL (ms):                           262.67    
Median ITL (ms):                         132.89    
P99 ITL (ms):                            2840.40   
==================================================

But running vLLM with VLLM_USE_V1=1 I got en error TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'q_lora_rank' with previous warnings like

`torch.compile` is turned on, but the model deepseek-ai/DeepSeek-R1 does not support it. Please open an issue on GitHubif you want it to be supported.

Thanks!

  • We are aware of the performance gap for DeepSeekV3 and are actively working on it. See [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676 which will resolve the gap. We will do a release hopefully today with this change
  • DeepSeekV3 is not yet supported on V1 since it requires chunked prefill. We are actively working on chunked prefill for MLA and hope to have it complete this week!

@robertgshaw2-redhat
Copy link
Collaborator

I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1?

Can you provide a more detailed reproduction instruction?

cc @WoosukKwon

@robertgshaw2-redhat
Copy link
Collaborator

👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.
I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

Thanks for fixing metrics logs in 0.7.1! Lack of pipeline parallelism in V1 is a show stopper for production deployments #11945

Thanks. We are actively working on PP

@robertgshaw2-redhat
Copy link
Collaborator

Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. get_multimodal_embeddings() expects that we return a list or tensor of length equal to the number of multimodal items provided in the batch and we then have to make unintuitive assumptions on how the output passed into get_input_embeddings would look like because the batching being used while calling both functions is not the same. It would be much nicer if for example the input and output of get_multimodal_embeddings are dicts with the keys being the different modalities.

Check out #sig-multi-modality in our slack! This is the best place for a discussion like this

@robertgshaw2-redhat
Copy link
Collaborator

Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently:

vllm 0.7.0 on 2x A6000:

Starting normally a 32b-awq model and using --max-model-len 32768 --gpu-memory-utilization 0.98 --tensor-parallel 2 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768

Everything works as previously, GPUs both get to ~44-46GB usage

Using VLLM_USE_V1=1 and the exact same parameters as above:

GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU.

Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x

Using V1 I get:

INFO 02-02 23:26:19 kv_cache_utils.py:400] Maximum concurrency for 32768 tokens per request: **22.25x**

And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1.

Its pretty hard to follow what you are seeing. Please attach:

  • launch command
  • logs

Thanks!

@gmonair
Copy link

gmonair commented Feb 4, 2025

Its pretty hard to follow what you are seeing. Please attach:

* launch command

* logs

Hi, please see vllm_output(27)-OOM.log for OOM on 4x L4 and vllm_output(28)-WORKS.log to compare. The only difference between them is the V1 flag.

Launch command

my_env = os.environ.copy()
my_env["VLLM_USE_V1"] = "0"

# background task
command = [
    "python", 
    "-m", 
    "vllm.scripts", 
    "serve",
    "/kaggle/input/qwen25/transformers/r1-32b-awq/1",
    "--served-model-name", "model",
    "--tensor_parallel_size", "4",
    "--gpu_memory_utilization", "0.95",
    "--port", "9901",
    "--max-num-batched-tokens", "32768",
    "--max-seq-len-to-capture", "32768",
    "--max-model-len", "32768",
    "--enable_prefix_caching",
]

process = subprocess.Popen(command, stdout=log_file, stderr=log_file, env=my_env)

vllm_output(28)-WORKS.log
vllm_output(27)-OOM.log

@njhill njhill added the v1 label Feb 4, 2025
@caoyang-lqp
Copy link

caoyang-lqp commented Feb 5, 2025

I ran the following code after upgrading the V1 version vllm and encountered an error:
import subprocess
import os
my_env = os.environ.copy()
my_env["VLLM_USE_V1"] = "1"
command = [
"python",
"-m",
"vllm.scripts",
"serve",
"./pretrained/intervl2-8B",
"--served-model-name", "intervl2-8B",
"--tensor_parallel_size", "2",
"--limit-mm-per-prompt","image=10" ,
"--pipeline-parallel-size","1",
"--gpu_memory_utilization", "0.9",
"--port", "40004",
"--max-num-batched-tokens", "10000",
"--max-seq-len-to-capture", "10000",
"--max-model-len", "10000",
"--enable_prefix_caching",
"--trust_remote_code"
]
process = subprocess.Popen(command, env=my_env)

Image

However, if --tensor_parallel_size" is set to 1, it works fine. Is there a compatibility issue with the v1 version with the multi-card deployment model?

@rstanislav
Copy link

rstanislav commented Feb 5, 2025

With dual rtx3090 in V1:
VLLM_USE_V1=1 REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt CUDA_DEVICE_ORDER=PCI_BUS_ID OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=0,1 vllm serve kosbu/QVQ-72B-Preview-AWQ --tensor-parallel-size 2 --gpu-memory-utilization 0.99 --api-key aaaaa --max-model-len 7000 --quantization=awq_marlin --enforce-eager

CUDA out of memory. Tried to allocate 594.00 MiB. GPU 0 has a total capacity of 23.48 GiB of which 587.38 MiB is free. Including non-PyTorch memory, this process has 22.89 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 815.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation

With v0 it works, something changed about memory in V1.

@JaheimLee
Copy link

Will V1 support flashinfer in the future?

@imkero
Copy link
Contributor

imkero commented Feb 6, 2025

Does V1 support FP8 (W8A8) quantization?

I tried nm-testing/Qwen2-VL-7B-Instruct-FP8-dynamic on v0.7.1 V1 arch, no error thrown but got gibberish result. Same code and model works properly on v0.7.1 V0 arch.


UPDATE: it works on v0.7.1 V1 arch eager mode, but borken on v0.7.1 V1 arch torch.compiled mode. I'm figuring out if this problem is model-dependent or not.

UPDATE: tried another model nm-testing/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic and same bug presents on v0.7.1 V1 arch torch.compiled mode


UPDATE: it works after i turned custom_ops on (change "none" to "all")

vllm/vllm/config.py

Lines 3237 to 3249 in 3ee696a

if envs.VLLM_USE_V1 and self.model_config is not None and \
not self.model_config.enforce_eager:
# NOTE(woosuk): Currently, we use inductor because the piecewise
# CUDA graphs do not work properly with the custom CUDA kernels.
# FIXME(woosuk): Disable inductor to reduce the compilation time
# and avoid any potential issues with the inductor.
self.compilation_config.custom_ops = ["none"]
self.compilation_config.use_cudagraph = True
self.compilation_config.use_inductor = True
self.compilation_config.cudagraph_num_of_warmups = 1
self.compilation_config.pass_config.enable_fusion = False
self.compilation_config.pass_config.enable_reshape = False
self.compilation_config.level = CompilationLevel.PIECEWISE

@lyhh123
Copy link

lyhh123 commented Feb 8, 2025

When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results.

I tested two samples:
• First sample: In v1 mode, the output was less than half of the expected result, while the default mode produced the complete output.
• Second sample: In v1 mode, the output was mostly complete but contained many obvious errors, whereas the default mode was correct and complete.

I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.

Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B?

@robertgshaw2-redhat
Copy link
Collaborator

When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results.

I tested two samples: • First sample: In v1 mode, the output was less than half of the expected result, while the default mode produced the complete output. • Second sample: In v1 mode, the output was mostly complete but contained many obvious errors, whereas the default mode was correct and complete.

I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.

Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B?

cc @ywang96

@ywang96
Copy link
Member

ywang96 commented Feb 8, 2025

When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results.

I tested two samples: • First sample: In v1 mode, the output was less than half of the expected result, while the default mode produced the complete output. • Second sample: In v1 mode, the output was mostly complete but contained many obvious errors, whereas the default mode was correct and complete.

I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.

Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B?

@lyhh123 can you open a separate issue for this and share some examples? There are multiple layers so I want to take a look where the issue might be.

I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.

This is also an interesting obversation since the V1-rearch for multimodal models should be model-agnostic, so I'm curious to see where the problem comes from.

@lyhh123
Copy link

lyhh123 commented Feb 11, 2025

When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results.
I tested two samples: • First sample: In v1 mode, the output was less than half of the expected result, while the default mode produced the complete output. • Second sample: In v1 mode, the output was mostly complete but contained many obvious errors, whereas the default mode was correct and complete.
I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.
Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B?

@lyhh123 can you open a separate issue for this and share some examples? There are multiple layers so I want to take a look where the issue might be.

I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs.

This is also an interesting obversation since the V1-rearch for multimodal models should be model-agnostic, so I'm curious to see where the problem comes from.

Thank you for paying attention to my issue. Two days ago, I encountered this problem during testing. Over the past two days, I have made a series of attempts to adjust the sampling parameters, mainly by modifying top_p or other parameters to maintain output stability as much as possible.

Currently, I have re-tested using the v1/default model and the qwen2.5vl-3B model. Apart from content related to coordinates, the outputs have remained largely consistent. I attempted to adjust the parameters but was unable to reproduce the issue from two days ago.

I still remember that, with fixed parameters at that time, there were unexpected differences across multiple outputs between the v1 and default modes. However, I cannot rule out the possibility of other potential variables affecting the results at that time. I will do my best to identify the root cause of the issue, and if I make any relevant discoveries, I will update you promptly.

@fan-niu
Copy link

fan-niu commented Feb 13, 2025

@robertgshaw2-redhat Hi now can we use V1 get high generate token throughput than V0 on Deepseek-R1?

@WoosukKwon
Copy link
Collaborator

@imkero Is the bug fixed now (without the change you suggested)? I wasn't able to reproduce the bug with the latest main.

@imkero
Copy link
Contributor

imkero commented Feb 13, 2025

@imkero Is the bug fixed now (without the change you suggested)? I wasn't able to reproduce the bug with the latest main.

@WoosukKwon

I have made a mistake. I found that it's use_cudagraph (but not custom_ops) that matters on this problem. I opened an issus #13212 to describe this problem in detail and for further discussion.

@mru4913
Copy link

mru4913 commented Feb 15, 2025

Is it possible to update Command R with v1 support ?

@Swipe4057
Copy link

Hello, in our production environment, we run Qwen2.5 with a tokenizer adapted for the language of our country. Currently, we are scaling across the entire company, and we are trying to analyze the performance of vLLM/Sglang. Unfortunately, v1 does not support Qwen 2.5. Moreover, in our production environment, structured JSON output and speculative decoding are actively used, and we really need these features together.

@shermansiu
Copy link

shermansiu commented Feb 16, 2025

vllm-flash-attn and flash-attn are currently separate packages. This means that if we are to add this to Conda-forge (e.g. conda-forge/staged-recipes#28931), then we'd have to maintain two very similar packages, which isn't ideal.

Will the changes made in vllm-project/flash-attention be eventually sync'd with Dao-AILab/flash-attention?

@fxmarty-amd
Copy link

fxmarty-amd commented Feb 17, 2025

On ce77eb9, in my case the CUDA graph recorded through torch.compile using VllmBackend with v1 gets stuck when replayed in case the number of scheduled tokens in GPUModelRunner.execute_model is below the largest cudagraph_capture_sizes, in the specific case where the first num_scheduled_tokens is NOT a multiple of 8 (to fit a captured size). No issue if the first num_scheduled_tokens does not get padded.

For example, launching vLLM with VLLM_USE_V1=1 vllm serve meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1 -O3, and running from lm-eval-harness:

lm_eval \
    --model local-completions \
    --model_args model=meta-llama/Llama-2-7b-chat-hf,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=9999999,timeout=3600,tokenized_requests=False \
    --tasks gsm8k \
    --limit 32 \
    --gen_kwargs "do_sample=False,top_p=1,temperature=1"

works well (even though later on some sequences in the batch hit EOS and we'd get e.g. to 31 tokens that will get padded), however, using:

lm_eval \
    --model local-completions \
    --model_args model=meta-llama/Llama-2-7b-chat-hf,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=9999999,timeout=3600,tokenized_requests=False \
    --tasks gsm8k \
    --limit 30 \
    --gen_kwargs "do_sample=False,top_p=1,temperature=1"

that first hits a cuda graph with some padding (precisely 2 padding tokens), this gets stuck, as soon as one tries to sync the model outputs from device (e.g. printing the hidden_states here:

hidden_states = hidden_states[:num_scheduled_tokens]
).

gpu utilization is shown in rocm-smi.

Hitting the issue both on torch==2.5.1+rocm6.2.4 and torch==2.7.0.dev20250217+rocm6.3.

It might only be a ROCm issue though.

@robertgshaw2-redhat
Copy link
Collaborator

On ce77eb9, in my case the CUDA graph recorded through torch.compile using VllmBackend with v1 gets stuck when replayed in case the number of scheduled tokens in GPUModelRunner.execute_model is below the largest cudagraph_capture_sizes, in the specific case where the first num_scheduled_tokens is NOT a multiple of 8 (to fit a captured size). No issue if the first num_scheduled_tokens does not get padded.

For example, launching vLLM with VLLM_USE_V1=1 vllm serve meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1 -O3, and running from lm-eval-harness:

lm_eval
--model local-completions
--model_args model=meta-llama/Llama-2-7b-chat-hf,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=9999999,timeout=3600,tokenized_requests=False
--tasks gsm8k
--limit 32
--gen_kwargs "do_sample=False,top_p=1,temperature=1"
works well (even though later on some sequences in the batch hit EOS and we'd get e.g. to 31 tokens that will get padded), however, using:

lm_eval
--model local-completions
--model_args model=meta-llama/Llama-2-7b-chat-hf,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=9999999,timeout=3600,tokenized_requests=False
--tasks gsm8k
--limit 30
--gen_kwargs "do_sample=False,top_p=1,temperature=1"
that first hits a cuda graph with some padding (precisely 2 padding tokens), this gets stuck, as soon as one tries to sync the model outputs from device (e.g. printing the hidden_states here:

vllm/vllm/v1/worker/gpu_model_runner.py

Line 948 in ce77eb9

hidden_states = hidden_states[:num_scheduled_tokens]
).
gpu utilization is shown in rocm-smi.

Hitting the issue both on torch==2.5.1+rocm6.2.4 and torch==2.7.0.dev20250217+rocm6.3.

It might only be a ROCm issue though.

@SageMoore FYI

@huang-junhong
Copy link

i use 2 * 8 A100(40G)start success,for me, the key is --gpu-memory-utilization 0.8 --quantization moe_wna16.
so my full start command is : VLLM_LOGGING_LEVEL=DEBUG python -m vllm.entrypoints.openai.api_server --model ???/r1_AWQ --served-model-name ??? --port ?? --pipeline-parallel-size 2 --tensor-parallel-size 8 --max_model_len 4096 --enable-reasoning --reasoning-parser deepseek_r1 --dtype float16 --trust-remote-code --gpu-memory-utilization 0.8 --quantization moe_wna16.

speed 18.6~18.8 TPS, note my cards without nvlink.
the key package version as:
vllm 0.7.2
ray 2.42.0
pytorch 2.5.1

@xzwczy
Copy link

xzwczy commented Feb 20, 2025

root@ecm-bb90:/var/model# pip show vllm
Name: vllm
Version: 0.7.2

root@ecm-bb90:/var/model# pip show transformers
Name: transformers
Version: 4.49.0

use: CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8212 --tensor-parallel-size 4 --model /var/model/Qwen2.5-VL-72B-Instruct --gpu-memory-utilization 0.9 --max-model-len 8192 --served-model-name "Qwen2.5-VL-72B-Instruct

root@ecm-bb90:/var/model# tail -f vllmoutput2.log
self.multimodal_config = self._init_multimodal_config(
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 424, in _init_multimodal_config
if ModelRegistry.is_multimodal_model(architectures):
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 445, in is_multimodal_model
model_cls, _ = self.inspect_model_cls(architectures)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 405, in inspect_model_cls
return self._raise_for_unsupported(architectures)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 357, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.
ERROR 02-20 14:39:59 registry.py:306] Error in inspecting model architecture 'Qwen2_5_VLForConditionalGeneration'
ERROR 02-20 14:39:59 registry.py:306] Traceback (most recent call last):
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 507, in _run_in_subprocess
ERROR 02-20 14:39:59 registry.py:306] returned.check_returncode()
ERROR 02-20 14:39:59 registry.py:306] File "/usr/lib/python3.10/subprocess.py", line 457, in check_returncode
ERROR 02-20 14:39:59 registry.py:306] raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 02-20 14:39:59 registry.py:306] subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 02-20 14:39:59 registry.py:306]
ERROR 02-20 14:39:59 registry.py:306] The above exception was the direct cause of the following exception:
ERROR 02-20 14:39:59 registry.py:306]
ERROR 02-20 14:39:59 registry.py:306] Traceback (most recent call last):
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 304, in _try_inspect_model_cls
ERROR 02-20 14:39:59 registry.py:306] return model.inspect_model_cls()
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 275, in inspect_model_cls
ERROR 02-20 14:39:59 registry.py:306] return _run_in_subprocess(
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 510, in _run_in_subprocess
ERROR 02-20 14:39:59 registry.py:306] raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 02-20 14:39:59 registry.py:306] RuntimeError: Error raised in subprocess:
ERROR 02-20 14:39:59 registry.py:306] /usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 02-20 14:39:59 registry.py:306] warn(RuntimeWarning(msg))
ERROR 02-20 14:39:59 registry.py:306] Traceback (most recent call last):
ERROR 02-20 14:39:59 registry.py:306] File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
ERROR 02-20 14:39:59 registry.py:306] return _run_code(code, main_globals, None,
ERROR 02-20 14:39:59 registry.py:306] File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
ERROR 02-20 14:39:59 registry.py:306] exec(code, run_globals)
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 531, in
ERROR 02-20 14:39:59 registry.py:306] _run()
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 524, in _run
ERROR 02-20 14:39:59 registry.py:306] result = fn()
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 276, in
ERROR 02-20 14:39:59 registry.py:306] lambda: _ModelInfo.from_model_cls(self.load_model_cls()))
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 279, in load_model_cls
ERROR 02-20 14:39:59 registry.py:306] mod = importlib.import_module(self.module_name)
ERROR 02-20 14:39:59 registry.py:306] File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module
ERROR 02-20 14:39:59 registry.py:306] return _bootstrap._gcd_import(name[level:], package, level)
ERROR 02-20 14:39:59 registry.py:306] File "", line 1050, in _gcd_import
ERROR 02-20 14:39:59 registry.py:306] File "", line 1027, in _find_and_load
ERROR 02-20 14:39:59 registry.py:306] File "", line 1006, in _find_and_load_unlocked
ERROR 02-20 14:39:59 registry.py:306] File "", line 688, in _load_unlocked
ERROR 02-20 14:39:59 registry.py:306] File "", line 883, in exec_module
ERROR 02-20 14:39:59 registry.py:306] File "", line 241, in _call_with_frames_removed
ERROR 02-20 14:39:59 registry.py:306] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 36, in
ERROR 02-20 14:39:59 registry.py:306] from transformers.models.qwen2_5_vl import (Qwen2_5_VLImageProcessor,
ERROR 02-20 14:39:59 registry.py:306] ImportError: cannot import name 'Qwen2_5_VLImageProcessor' from 'transformers.models.qwen2_5_vl' (/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/init.py)
ERROR 02-20 14:39:59 registry.py:306]
ERROR 02-20 14:39:59 engine.py:389] Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.
ERROR 02-20 14:39:59 engine.py:389] Traceback (most recent call last):
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
ERROR 02-20 14:39:59 engine.py:389] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 118, in from_engine_args
ERROR 02-20 14:39:59 engine.py:389] engine_config = engine_args.create_engine_config(usage_context)
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1075, in create_engine_config
ERROR 02-20 14:39:59 engine.py:389] model_config = self.create_model_config()
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 998, in create_model_config
ERROR 02-20 14:39:59 engine.py:389] return ModelConfig(
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 364, in init
ERROR 02-20 14:39:59 engine.py:389] self.multimodal_config = self._init_multimodal_config(
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 424, in _init_multimodal_config
ERROR 02-20 14:39:59 engine.py:389] if ModelRegistry.is_multimodal_model(architectures):
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 445, in is_multimodal_model
ERROR 02-20 14:39:59 engine.py:389] model_cls, _ = self.inspect_model_cls(architectures)
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 405, in inspect_model_cls
ERROR 02-20 14:39:59 engine.py:389] return self._raise_for_unsupported(architectures)
ERROR 02-20 14:39:59 engine.py:389] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 357, in _raise_for_unsupported
ERROR 02-20 14:39:59 engine.py:389] raise ValueError(
ERROR 02-20 14:39:59 engine.py:389] ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

how can i do.

@nku-zhichengzhang
Copy link

I am trying to run inference on Qwen2.5-VL-72B for video processing using 4xA800 GPUs. However, I encountered errors when executing the code with VLLM V1, whereas it works correctly with VLLM V0 by setting VLLM_USE_V1=0.
#13629

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
    tensor_parallel_size=4,
    gpu_memory_utilization=0.7
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)
question = ''
messages = [
    {"role": "system", "content": "You are a good video analyst"},
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": file,
            },
            {"type": "text", "text": question},
        ],
    }
]
prompt = self.processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
    # FPS will be returned in video_kwargs
    #"mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374] WorkerProc hit an exception: %s
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374] Traceback (most recent call last):
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in worker_busy_loop
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     output = func(*args, **kwargs)
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     return func(*args, **kwargs)
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 873, in execute_model
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     self._update_states(scheduler_output)
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 331, in _update_states
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     MRotaryEmbedding.get_input_positions_tensor(
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 929, in get_input_positions_tensor
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]     video_second_per_grid_t = second_per_grid_ts[video_index]
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374]                               ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(VllmWorker rank=3 pid=80742) ERROR 02-21 04:40:48 multiproc_executor.py:374] IndexError: list index out of range
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374] Traceback (most recent call last):
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in worker_busy_loop
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 873, in execute_model
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     self._update_states(scheduler_output)
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 331, in _update_states
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     MRotaryEmbedding.get_input_positions_tensor(
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 929, in get_input_positions_tensor
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]     video_second_per_grid_t = second_per_grid_ts[video_index]
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374]                               ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(VllmWorker rank=0 pid=80696) ERROR 02-21 04:40:48 multiproc_executor.py:374] IndexError: list index out of range
ERROR 02-21 04:40:48 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-21 04:40:48 core.py:291]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 284, in run_engine_core
ERROR 02-21 04:40:48 core.py:291]     engine_core.run_busy_loop()
ERROR 02-21 04:40:48 core.py:291]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 327, in run_busy_loop
ERROR 02-21 04:40:48 core.py:291]     outputs = step_fn()
ERROR 02-21 04:40:48 core.py:291]               ^^^^^^^^^
ERROR 02-21 04:40:48 core.py:291]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 154, in step
ERROR 02-21 04:40:48 core.py:291]     output = self.model_executor.execute_model(scheduler_output)
ERROR 02-21 04:40:48 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 04:40:48 core.py:291]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in execute_model
ERROR 02-21 04:40:48 core.py:291]     output = self.collective_rpc("execute_model",
ERROR 02-21 04:40:48 core.py:291]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-21 04:40:48 core.py:291]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 02-21 04:40:48 core.py:291]     raise e
ERROR 02-21 04:40:48 core.py:291]   File "/home/zhangzhicheng03/anaconda3/envs/vllm1/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 02-21 04:40:48 core.py:291]     raise result
ERROR 02-21 04:40:48 core.py:291] IndexError: list index out of range
ERROR 02-21 04:40:48 core.py:291] 
CRITICAL 02-21 04:40:49 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

@npuichigo
Copy link

npuichigo commented Feb 22, 2025

Cannot work with Qwen 2 #13284

@lbeisteiner
Copy link

CUDA error when using [ngram] speculative decoding: #13673

@DefinitlyEvil
Copy link

Somehow GGUF isn't working great, it will crash the V1 engine by weird CUDA errors by OOM errors but no such errors will be present via V0.

@echozyr2001
Copy link

On the 28h20 machine, I start docker with the following compose file

Node1

services:
  v3-1-vllm:
    container_name: v3-1-vllm
    image: vllm/vllm-openai:latest
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "1024g"
    ipc: "host"
    network_mode: "host"
    volumes:
      - /data/deepseek-v3:/root/.cache/huggingface
      - /data/torchcache-v3:/root/torchcache
    environment:
      - VLLM_HOST_IP=10.0.251.33
      - GLOO_SOCKET_IFNAME=ens12f0np0
      - NCCL_SOCKET_IFNAME=ibs1
      - NCCL_IB_ALLOW=1
      - NCCL_IB_DISABLE=0
      - NCCL_IB_CUDA_SUPPORT=1
      - NCCL_IB_HCA=ibp1
      - NCCL_IB_RETRY_CNT=13
      - NCCL_IB_GID_INDEX=3
      - NCCL_NET_GDR_LEVEL=2
      - NCCL_IB_TIMEOUT=22
      - NCCL_DEBUG=INFO
      - NCCL_P2P_LEVEL=NVL
      - NCCL_CROSS_NIC=1
      - NCCL_NET_GDR_LEVEL=SYS
    entrypoint:
      - /bin/bash
      - -c
      - |
        (nohup ray start --disable-usage-stats --block --head --port=6379 > /init.log 2>&1 &)
        sleep 10 && python3 -m vllm.entrypoints.openai.api_server \
          --served-model-name "deepseek/deepseek-v3" \
          --model /root/.cache/huggingface \
          --host 0.0.0.0 \
          --port 30000 \
          --enable-prefix-caching \
          --enable-chunked-prefill \
          --pipeline-parallel-size  2 \
          --tensor-parallel-size  8 \
          --gpu-memory-utilization 0.95 \
          --max-model-len 64128 \
          --max-num-batched-tokens 8192 \
          --scheduling-policy priority \
          --trust-remote-code \
          --max-num-seqs 12 \
          --swap-space 16 \
          --block_size 32 \
          --disable_log_requests
    restart: always

Node2

services:
  v3-2-vllm:
    container_name: v3-2-vllm # 容器名
    image: vllm/vllm-openai:latest
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "1024g"
    ipc: "host"
    network_mode: "host"
    volumes:
      - /data/deepseek-v3:/root/.cache/huggingface
      - /data/torchcache-v3:/root/torchcache
    environment:
      - VLLM_HOST_IP=10.0.251.15
      - GLOO_SOCKET_IFNAME=ens12f0np0
      - NCCL_SOCKET_IFNAME=ibs1
      - NCCL_IB_ALLOW=1
      - NCCL_IB_DISABLE=0
      - NCCL_IB_CUDA_SUPPORT=1
      - NCCL_IB_HCA=ibp1
      - NCCL_IB_RETRY_CNT=13
      - NCCL_IB_GID_INDEX=3
      - NCCL_NET_GDR_LEVEL=2
      - NCCL_IB_TIMEOUT=22
      - NCCL_DEBUG=INFO
      - NCCL_P2P_LEVEL=NVL
      - NCCL_CROSS_NIC=1
      - NCCL_NET_GDR_LEVEL=SYS
    entrypoint:
      - /bin/bash
      - -c
      - |
         ray start --block --disable-usage-stats --address=10.0.251.33:6379
    restart: always

But I have a problem

WARNING 02-25 21:31:47 config.py:3473] torch.compile is turned on, but the model /root/.cache/huggingface/hub/deepseek-ai/DeepSeek-V3 does not support it. Please open an issue on GitHubif you want it to be supported.

How should I specify the model name and model path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests