Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Add multipstep chunked-prefill support for FlashInfer #10467

Merged
merged 3 commits into from
Jan 15, 2025

Conversation

elfiegg
Copy link
Contributor

@elfiegg elfiegg commented Nov 20, 2024

Support multi-step scheduling for chunked-prefill on FlashInfer, where prefill tokens are turned into decode tokens after the first single step.

cc @comaniac @yzh199 @WoosukKwon @youkaichao

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

"specific parameter.")

if turn_prefills_into_decodes:
# When mutli-Step is enabled with chunked-Prefill, prefills and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# When mutli-Step is enabled with chunked-Prefill, prefills and
# When Multi-Step is enabled with Chunked-Prefill, prefills and

@comaniac
Copy link
Collaborator

Please fix the linting.

@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2024
@taegeonum
Copy link

I've got the following error:

(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner.py", line 1624, in execute_model
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]     self.attn_state.begin_forward(model_input)
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/attention/backends/flashinfer.py", line 262, in begin_forward
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]     state = (self.runner.graph_runners[model_input.virtual_engine]
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] KeyError: 2052
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] 

@elfiegg
Copy link
Contributor Author

elfiegg commented Nov 21, 2024

I've got the following error:

(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner.py", line 1624, in execute_model
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]     self.attn_state.begin_forward(model_input)
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/attention/backends/flashinfer.py", line 262, in begin_forward
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229]     state = (self.runner.graph_runners[model_input.virtual_engine]
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] KeyError: 2052
(VllmWorkerProcess pid=505659) ERROR 11-21 10:00:59 multiproc_worker_utils.py:229] 

Could you please share the repo command? Thank you! @taegeonum

@taegeonum
Copy link

@elfiegg VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model <any_model> --quantization fp8 --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --num-scheduler-steps 10 --enable-chunked-prefill True --max-num-batched-tokens 512

@elfiegg elfiegg force-pushed the chunked_multistep branch 3 times, most recently from ad48534 to 97d6859 Compare November 27, 2024 23:35
@elfiegg
Copy link
Contributor Author

elfiegg commented Nov 27, 2024

@taegeonum apologies for the delay - the bug should be in cuda graph mode and has been fixed. Tests are configured.

@taegeonum
Copy link

@elfiegg Thanks! but got another error:


ERROR 11-28 15:40:26 engine.py:366] Error in model execution: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 0 vs 1
ERROR 11-28 15:40:26 engine.py:366] Traceback (most recent call last):
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-28 15:40:26 engine.py:366]     return func(*args, **kwargs)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner.py", line 1652, in execute_model
ERROR 11-28 15:40:26 engine.py:366]     self.attn_state.begin_forward(model_input)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/attention/backends/flashinfer.py", line 268, in begin_forward
ERROR 11-28 15:40:26 engine.py:366]     model_input.attn_metadata.begin_forward()
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/attention/backends/flashinfer.py", line 385, in begin_forward
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/flashinfer/decode.py", line 530, in plan                                                                                                         [473/1637]
ERROR 11-28 15:40:26 engine.py:366]     self._wrapper.plan(
ERROR 11-28 15:40:26 engine.py:366] RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 0 vs 1
ERROR 11-28 15:40:26 engine.py:366]
ERROR 11-28 15:40:26 engine.py:366] The above exception was the direct cause of the following exception:
ERROR 11-28 15:40:26 engine.py:366]
ERROR 11-28 15:40:26 engine.py:366] Traceback (most recent call last):
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-28 15:40:26 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-28 15:40:26 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 11-28 15:40:26 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/engine/llm_engine.py", line 338, in __init__
ERROR 11-28 15:40:26 engine.py:366]     self._initialize_kv_caches()
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/engine/llm_engine.py", line 476, in _initialize_kv_caches
ERROR 11-28 15:40:26 engine.py:366]     self.model_executor.determine_num_available_blocks())
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
ERROR 11-28 15:40:26 engine.py:366]     num_blocks = self._run_workers("determine_num_available_blocks", )
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/executor/multiproc_gpu_executor.py", line 195, in _run_workers
ERROR 11-28 15:40:26 engine.py:366]     driver_worker_output = driver_worker_method(*args, **kwargs)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-28 15:40:26 engine.py:366]     return func(*args, **kwargs)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/worker/worker.py", line 198, in determine_num_available_blocks
ERROR 11-28 15:40:26 engine.py:366]     self.model_runner.profile_run()
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/worker/multi_step_model_runner.py", line 662, in profile_run
ERROR 11-28 15:40:26 engine.py:366]     return self._base_model_runner.profile_run()
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-28 15:40:26 engine.py:366]     return func(*args, **kwargs)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner.py", line 1343, in profile_run
ERROR 11-28 15:40:26 engine.py:366]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-28 15:40:26 engine.py:366]     return func(*args, **kwargs)
ERROR 11-28 15:40:26 engine.py:366]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 11-28 15:40:26 engine.py:366]     raise type(err)(f"Error in model execution: "
ERROR 11-28 15:40:26 engine.py:366] RuntimeError: Error in model execution: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 0 vs 1
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner.py", line 1652, in execute_model
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     self.attn_state.begin_forward(model_input)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/attention/backends/flashinfer.py", line 268, in begin_forward
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     model_input.attn_metadata.begin_forward()
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/attention/backends/flashinfer.py", line 385, in begin_forward
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     self.decode_wrapper.begin_forward(
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/flashinfer/decode.py", line 530, in plan
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     self._wrapper.plan(
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229] RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 0 vs 1
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/worker.py", line 198, in determine_num_available_blocks
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     self.model_runner.profile_run()
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/multi_step_model_runner.py", line 662, in profile_run
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     return self._base_model_runner.profile_run()
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner.py", line 1343, in profile_run
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/venv-vllm-0.6.4-latest/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     return func(*args, **kwargs)
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]   File "/group-volume/taegeon/vllm-public/vllm/worker/model_runner_base.py", line 146, in _wrapper
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229]     raise type(err)(f"Error in model execution: "
(VllmWorkerProcess pid=311147) ERROR 11-28 15:40:26 multiproc_worker_utils.py:229] RuntimeError: Error in model execution: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 0 vs 1

@elfiegg
Copy link
Contributor Author

elfiegg commented Nov 28, 2024

Could you please share the reproduce command? Thanks! @taegeonum

@taegeonum
Copy link

VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model model_path -tensor-parallel-size 8 --quantization fp8 --kv-cache-dtype fp8 --max-num-seqs 500 --max-model-len 32768 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.95 --trust-remote-code --enable-chunked-prefill true --num-scheduler-steps 10

@taegeonum
Copy link

@elfiegg Hello, any progress? it would be good if we can use multistep+chunked prefill on FlashInfer.

@elfiegg
Copy link
Contributor Author

elfiegg commented Dec 3, 2024

@taegeonum sure - I'm just back from Thanksgiving vacation and will update this tomorrow

@elfiegg
Copy link
Contributor Author

elfiegg commented Dec 11, 2024

Hello @JaheimLee, There seems to be a bug in multistep+chunked prefill cuda graph mode, where it schedules batch tokens during profiling where it shouldn't really do that. The issue with the model config above seems unrelated to multistep+chunked prefill on FlashInfer. Using FlashAttn I observed the similar fail.

Also, if you turn off cuda graph mode via

VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model model_path -tensor-parallel-size 8 --quantization fp8 --kv-cache-dtype fp8 --max-num-seqs 500 --max-model-len 32768 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.95 --trust-remote-code --enable-chunked-prefill true --num-scheduler-steps 10 --enforce-eager

This will work. I'm trying to narrow down the issue and it seems it might relate to the PR: https://github.com/vllm-project/vllm/pull/8645/files here. cc @varun-sundar-rabindranath for more contexts if there is any.

@elfiegg elfiegg force-pushed the chunked_multistep branch 2 times, most recently from d5fe18f to a4b09d6 Compare December 12, 2024 00:15
@elfiegg
Copy link
Contributor Author

elfiegg commented Dec 12, 2024

Hello @JaheimLee, can you pull the latest changes and confirm if they fix the issue? Thanks!

@elfiegg elfiegg force-pushed the chunked_multistep branch 5 times, most recently from c10e056 to d937230 Compare January 9, 2025 06:31
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. Also per offline discussion, please try to reduce the test time if possible. Thanks for the work!

Comment on lines 319 to 320
self.curr_sliding_window_blocks = (
curr_sliding_window_blocks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are lots of style changes in this file? Could you revert unrelated changes to make the real changes clear?

@youkaichao youkaichao merged commit 0794e74 into vllm-project:main Jan 15, 2025
76 of 77 checks passed
@elfiegg
Copy link
Contributor Author

elfiegg commented Jan 15, 2025 via email

ice-tong pushed a commit to ice-tong/vllm that referenced this pull request Jan 18, 2025
@Juelianqvq
Copy link
Contributor

@elfiegg found RuntimeError: tensor: name = block_tables, shape = [256, 512] is_cont = 1, type = int is not as expected: shape = [1, -1], type = Int with flashinfer v0.1.6+cu124

@elfiegg
Copy link
Contributor Author

elfiegg commented Jan 21, 2025

@elfiegg found RuntimeError: tensor: name = block_tables, shape = [256, 512] is_cont = 1, type = int is not as expected: shape = [1, -1], type = Int with flashinfer v0.1.6+cu124

Can you provide repo commands? Thanks!

@Juelianqvq
Copy link
Contributor

Juelianqvq commented Jan 22, 2025

Can you provide repo commands? Thanks!
@elfiegg I turned on Prefix Caching as well.
vllm serve Qwen2.5-72B-Instruct-AWQ -tp 8 --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 2048 --num-scheduler-steps 8

HwwwwwwwH pushed a commit to HwwwwwwwH/vllm that referenced this pull request Jan 22, 2025
abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025
abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025
Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Feb 2, 2025
hongxiayang added a commit to ROCm/vllm that referenced this pull request Feb 3, 2025
…ntion (#399)

* [V1] Avoid sending text prompt to core engine (vllm-project#11963)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [CI/Build] Add markdown linter (vllm-project#11857)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <757486878@qq.com>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <f1renze.142857@gmail.com>

* [Doc] Fix build from source and installation link in README.md (vllm-project#12013)

Signed-off-by: Yikun <yikunkero@gmail.com>

* Using list

* [Bugfix] Fix deepseekv3 gate bias error (vllm-project#12002)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Multi-lingual P3L (#356)

* Commiting the *multilingual* P3L test.

* Created a *multi-lingual* P3L test.

* Making ruff happy.

* .

* Added a reference to the language-scripture Confluence table.

* Typo fixing.

* Harmonizing naming.

* Fixing comments in the header.

---------

Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* Trying to make scales work with compileable attention

* [Docs] Add Sky Computing Lab to project intro (vllm-project#12019)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [HPU][Bugfix] set_forward_context and CI test execution (vllm-project#12014)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [HPU][misc] add comments for explanation (vllm-project#12034)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Kernel] Revert the API change of Attention.forward (vllm-project#12038)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Platform] Add output for Attention Backend (vllm-project#11981)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Explain where the engine args go when using Docker (vllm-project#12041)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Docs lint

* [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045)

* [Misc]  Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Kernel] Support MulAndSilu (vllm-project#11624)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Platform] move current_memory_usage() into platform (vllm-project#11369)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467)

* [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

* [core] platform agnostic executor via collective_rpc (vllm-project#11256)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003)

* Fix: cases with empty sparsity config (vllm-project#12057)

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

* Type-fix: make execute_model output type optional (vllm-project#12020)

* [Platform] Do not raise error if _Backend is not found (vllm-project#12023)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>

* [Model]: Support internlm3 (vllm-project#12037)

* Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765)

Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082)

* [Bugfix] use right truncation for non-generative tasks (vllm-project#12050)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [V1][Core] Autotune encoder cache budget (vllm-project#11895)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090)

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087)

* [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Doc] Add documentation for specifying model architecture (vllm-project#12105)

* Various cosmetic/comment fixes (vllm-project#12089)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Support torchrun and SPMD-style offline inference (vllm-project#12071)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Enable user marker for vllm profiling (#357)

* Enable user marker for vllm profiling

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

* [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Deepseek V3 support (#364)

* Changing the hard coded datatype to see if it's enough for the model to work

* Picking the upstrteam moe kernel version

* make upstream fix for v3 also works for rocm v2

* Conditional fnuz dtype

* Requantizing from fn to fnuz

* Requantizing moe as well

* Actually requantizing moe weights

* Conditional requantization and assert on padding in block quant

* Format

---------

Co-authored-by: charlifu <charlifu@amd.com>

* [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121)

Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

* [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Add deepseek_vl2 chat template (vllm-project#12143)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [ROCm][MoE] moe tuning support for rocm (vllm-project#12049)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

* [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [core] allow callable in collective_rpc (vllm-project#12151)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119)

Signed-off-by: Wallas Santos <wallashss@ibm.com>

* [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [torch.compile] disable logging when cache is disabled (vllm-project#12043)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [misc] fix cross-node TP (vllm-project#12166)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172)

Signed-off-by: hongxyan <hongxyan@amd.com>

* [core] further polish memory profiling (vllm-project#12126)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Docs] Fix broken link in SECURITY.md (vllm-project#12175)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Support register quantization method out-of-tree (vllm-project#11969)

* [V1] Collect env var for usage stats (vllm-project#12115)

* [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152)

Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>

* [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187)

* [torch.compile] store inductor compiled Python file (vllm-project#12182)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* benchmark_serving support --served-model-name param (vllm-project#12109)

Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

* [Misc] Add BNB support to GLM4-V model (vllm-project#12184)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1] Add V1 support of Qwen2-VL (vllm-project#12128)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Support for fairseq2 Llama (vllm-project#11442)

Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>

* [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [torch.compile] fix sym_tensor_indices (vllm-project#12191)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Move linting to `pre-commit` (vllm-project#11975)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [DOC] Fix typo in docstring and assert message (vllm-project#12194)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Model] Add Qwen2 PRM model support (vllm-project#12202)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] add placeholder format.sh (vllm-project#12206)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [CI/Build] Remove dummy CI steps (vllm-project#12208)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI/Build] Make pre-commit faster (vllm-project#12212)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core][bugfix] configure env var during import vllm (vllm-project#12209)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1] Remove `_get_cache_block_size` (vllm-project#12214)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Pass `attention` to impl backend (vllm-project#12218)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Using ROCm6.3.1 base docker and building hipblas-common (#366)

* [Misc] Update CODEOWNERS (vllm-project#12229)

* fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227)

Signed-off-by: isikhi <huseyin.isik000@gmail.com>

* [misc] add cuda runtime version to usage data (vllm-project#12190)

Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

* [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210)

Signed-off-by: Jason Cheng <jasoncky96@gmail.com>

* [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Add quantization and guided decoding CODEOWNERS (vllm-project#12228)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [ci/build] disable failed and flaky tests (vllm-project#12240)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration  (vllm-project#12237)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Misc] Remove redundant TypeVar from base model (vllm-project#12248)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [torch.compile] transparent compilation with more logging (vllm-project#12246)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Platform] improve platforms getattr (vllm-project#12264)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [ci/build] update nightly torch for gh200 test (vllm-project#12270)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802)

Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

* [Kernel] fix moe_align_block_size error condition (vllm-project#12239)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

* [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types  (vllm-project#10907)

Signed-off-by: rickyx <rickyx@anyscale.com>

* [Bugfix] Multi-sequence broken (vllm-project#11898)

Signed-off-by: Andy Lo <andy@mistral.ai>

* [Misc] Remove experimental dep from tracing.py (vllm-project#12007)

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>

* [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477)

* Update pre-commit.yml (#374)

* Update pre-commit.yml

* Reapplying missing format

* New codespell exclude location

---------

Co-authored-by: Kevin H. Luu <kevin@anyscale.com>

* [bugfix] moe tuning. rm is_navi() (vllm-project#12273)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277)

Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: maleksan85 <maleksan@amd.com>

* [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281)

Signed-off-by: Hongxia Yang <hongxyan@amd.com>

* [VLM] Simplify post-processing of replacement info (vllm-project#12269)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ci/lint] Add back default arg for pre-commit (vllm-project#12279)

Signed-off-by: kevin <kevin@anyscale.com>

* [CI] add docker volume prune to neuron CI (vllm-project#12291)

Signed-off-by: Liangfu Chen <liangfc@amazon.com>

* [Ci/Build] Fix mypy errors on main (vllm-project#12296)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [core] separate builder init and builder prepare for each batch (vllm-project#12253)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Build] update requirements of no-device (vllm-project#12299)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [Core] Support fully transparent sleep mode (vllm-project#11743)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [VLM] Avoid unnecessary tokenization (vllm-project#12310)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model][Bugfix]: correct Aria model output (vllm-project#12309)

Signed-off-by: xffxff <1247714429@qq.com>

* [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Doc] Add docs for prompt replacement (vllm-project#12318)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319)

Signed-off-by: wangerxiao <863579016@qq.com>

* [Misc]  Improve the readability of BNB error messages  (vllm-project#12320)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367)

* switching detokenize flag to be False

* detokenize = False for benchmarks

* restoring default in main vllm code for detokenize

* removing extra spaces

* moving detokenize to flag

* adding support for token ids

---------

Co-authored-by: maleksan85 <maleksan@amd.com>

* [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Core] Support `reset_prefix_cache` (vllm-project#12284)

* [Frontend][V1] Online serving performance improvements (vllm-project#12287)

* [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* FP8 FA fixes (#381)

* FP8 FA fixes

Summary:
Add missing clamp and fix reciprocal scale computation.

* linter

* Returning the use of the proper stream in allreduce (#382)

* [Bugfix] Fixing  AMD LoRA CI test. (vllm-project#12329)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Docs] Update FP8 KV Cache documentation (vllm-project#12238)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Docs] Document vulnerability disclosure process (vllm-project#12326)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1] Add `uncache_blocks` (vllm-project#12333)

* [doc] explain common errors around torch.compile (vllm-project#12340)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338)

Signed-off-by: zhenwei <zhenweiliu@habana.ai>

* [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Kernel] Flash Attention 3 Support (vllm-project#12093)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Doc] Troubleshooting errors during model inspection (vllm-project#12351)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1] Simplify M-RoPE (vllm-project#12352)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: imkero <kerorek@outlook.com>

* [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] add wake_up doc and some sanity check (vllm-project#12361)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

* [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Docs] Document Phi-4 support (vllm-project#12362)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (vllm-project#11528)

Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Docs] Add meetup slides (vllm-project#12345)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384)

* [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298)

Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

* Set weights_only=True when using torch.load() (vllm-project#12366)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Bugfix] Path join when building local path for S3 clone (vllm-project#12353)

Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>

* Update compressed-tensors version (vllm-project#12367)

* [V1] Increase default batch size for H100/H200 (vllm-project#12369)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [perf] fix perf regression from vllm-project#12253 (vllm-project#12380)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [ci/build] fix wheel size check (vllm-project#12396)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382)

* [ci/build] sync default value for wheel size (vllm-project#12398)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Enable proxy support in benchmark script (vllm-project#12356)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Applying scales rename to fp8 config (#387)

* [Misc] Remove deprecated code (vllm-project#12383)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Dev-docker Documentation Updates (#378)

* Dev-docker Documentation Updates

Minor updates to several sections, with links to other documents where appropriate.

* Fix formatting of GEMM filename

* README cleanup

- Reorder some sections of the README to make them easier to follow
- Improve formatting of bash commands
- Prefer use of huggingface model names instead of hard-coded directories
- Clean up wording

* Expanded sample commands for Latency and Throughput

* Fix markdown links

* Fix pre-commit errors

* Updates from review

Initial updates to incorporate feedback from a review session held with @t-parry

* Update script args to match current recommendations

* Remove recommended max-num-seqs values for now

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413)

* [Bugfix] Fix BLIP-2 processing (vllm-project#12412)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094)

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

* [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445)

* [Frontend] generation_config.json for  maximum tokens(vllm-project#12242)

Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450)

* [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Build/CI] Fix libcuda.so linkage (vllm-project#12424)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [Frontend] Rerank API (Jina- and Cohere-compatible API)  (vllm-project#12376)

Signed-off-by: Kyle Mistele <kyle@mistele.com>

* [DOC] Add link to vLLM blog (vllm-project#12460)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [V1] Avoid list creation in input preparation (vllm-project#12457)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Frontend] Support scores endpoint in run_batch (vllm-project#12430)

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

* [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [V1][Metrics] Add initial Prometheus logger (vllm-project#12416)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194)

Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* Support FP8 FA from Quark format (#388)

* Support FP8 FA from Quark format

* Support FP8 FA from Quark format

* nit: update comment

* Direct call on ROCm

* 20250127 docs update (#392)

* updating code blocks

* typo

* updated manifest

* Including feedback

* whitespace

* Deepseek instructions

* hyperlink fix

* hyperlink fix

* updating what is new

* cpx update

* typo

* whitespace

* whitespace

* Faster Custom Paged Attention kernels (#372)

* integrate new cpa kernel, update tests and benchmark

* added comments to mfma4 kernel

* further comments for mfma16 kernel

* clang-format

* Lint

* add flag for logits rtz conversion and disable by default

* lint

* [Bugfix]: Fix paged attention unit tests of #372 (#389)

* [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and  `csrc/rocm/attention.cu`.

* improve code documentation.

* lint

---------

Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

* Using a more precise profiling on ROCm to properly account for weights padding (#394)

* Update Dockerfile.rocm

* [Bugfix]: inclucde the env variables required for running FastSyncLLM

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* fix pre-commit lint

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Ye Qi <yeq@meta.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Fred Reiss <frreiss@us.ibm.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: hongxyan <hongxyan@amd.com>
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Signed-off-by: Martin Gleize <mgleize@meta.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: isikhi <huseyin.isik000@gmail.com>
Signed-off-by: Jason Cheng <jasoncky96@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
Signed-off-by: maleksan85 <maleksan@amd.com>
Signed-off-by: Hongxia Yang <hongxyan@amd.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: wangerxiao <863579016@qq.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Kyle Mistele <kyle@mistele.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: YiSheng5 <yi.sheng@intel.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
Co-authored-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuan <yuan.zhou@intel.com>
Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com>
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com>
Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com>
Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: WangErXiao <863579016@qq.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Charles Frye <cfrye59@gmail.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: cennn <61925104+cennn@users.noreply.github.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: minmin <rmm0811@gmail.com>
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Fred Reiss <frreiss@us.ibm.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Keyun Tong <tongkeyun@gmail.com>
Co-authored-by: RunningLeon <maningsheng@sensetime.com>
Co-authored-by: kewang-xlnx <73578509+kewang-xlnx@users.noreply.github.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: tvirolai-amd <teemu.virolainen@amd.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: charlifu <charlifu@amd.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: yancong <32220263+ice-tong@users.noreply.github.com>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: gujing <925973396@qq.com>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: Işık <41375111+isikhi@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Jannis Schönleber <joennlae@gmail.com>
Co-authored-by: Ricky Xu <xuchen727@hotmail.com>
Co-authored-by: Andy Lo <andylolu24@gmail.com>
Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com>
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: zhou fan <1247714429@qq.com>
Co-authored-by: ilia-cher <30845429+ilia-cher@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: omer-dayan <omer@run.ai>
Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com>
Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Co-authored-by: Kyle Mistele <kyle@mistele.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Co-authored-by: sanyalington <shomy.sanyal@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
hongxiayang added a commit to ROCm/vllm that referenced this pull request Feb 5, 2025
* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <757486878@qq.com>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <f1renze.142857@gmail.com>

* [Doc] Fix build from source and installation link in README.md (vllm-project#12013)

Signed-off-by: Yikun <yikunkero@gmail.com>

* Using list

* [Bugfix] Fix deepseekv3 gate bias error (vllm-project#12002)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Multi-lingual P3L (#356)

* Commiting the *multilingual* P3L test.

* Created a *multi-lingual* P3L test.

* Making ruff happy.

* .

* Added a reference to the language-scripture Confluence table.

* Typo fixing.

* Harmonizing naming.

* Fixing comments in the header.

---------

Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* Trying to make scales work with compileable attention

* [Docs] Add Sky Computing Lab to project intro (vllm-project#12019)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [HPU][Bugfix] set_forward_context and CI test execution (vllm-project#12014)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [HPU][misc] add comments for explanation (vllm-project#12034)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Kernel] Revert the API change of Attention.forward (vllm-project#12038)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Platform] Add output for Attention Backend (vllm-project#11981)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Explain where the engine args go when using Docker (vllm-project#12041)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Docs lint

* [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045)

* [Misc]  Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Kernel] Support MulAndSilu (vllm-project#11624)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Platform] move current_memory_usage() into platform (vllm-project#11369)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467)

* [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

* [core] platform agnostic executor via collective_rpc (vllm-project#11256)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003)

* Fix: cases with empty sparsity config (vllm-project#12057)

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

* Type-fix: make execute_model output type optional (vllm-project#12020)

* [Platform] Do not raise error if _Backend is not found (vllm-project#12023)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>

* [Model]: Support internlm3 (vllm-project#12037)

* Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765)

Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082)

* [Bugfix] use right truncation for non-generative tasks (vllm-project#12050)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [V1][Core] Autotune encoder cache budget (vllm-project#11895)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090)

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087)

* [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Doc] Add documentation for specifying model architecture (vllm-project#12105)

* Various cosmetic/comment fixes (vllm-project#12089)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Support torchrun and SPMD-style offline inference (vllm-project#12071)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Enable user marker for vllm profiling (#357)

* Enable user marker for vllm profiling

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

* [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Deepseek V3 support (#364)

* Changing the hard coded datatype to see if it's enough for the model to work

* Picking the upstrteam moe kernel version

* make upstream fix for v3 also works for rocm v2

* Conditional fnuz dtype

* Requantizing from fn to fnuz

* Requantizing moe as well

* Actually requantizing moe weights

* Conditional requantization and assert on padding in block quant

* Format

---------

Co-authored-by: charlifu <charlifu@amd.com>

* [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121)

Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

* [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Add deepseek_vl2 chat template (vllm-project#12143)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [ROCm][MoE] moe tuning support for rocm (vllm-project#12049)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

* [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [core] allow callable in collective_rpc (vllm-project#12151)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119)

Signed-off-by: Wallas Santos <wallashss@ibm.com>

* [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [torch.compile] disable logging when cache is disabled (vllm-project#12043)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [misc] fix cross-node TP (vllm-project#12166)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172)

Signed-off-by: hongxyan <hongxyan@amd.com>

* [core] further polish memory profiling (vllm-project#12126)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Docs] Fix broken link in SECURITY.md (vllm-project#12175)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Support register quantization method out-of-tree (vllm-project#11969)

* [V1] Collect env var for usage stats (vllm-project#12115)

* [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152)

Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>

* [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187)

* [torch.compile] store inductor compiled Python file (vllm-project#12182)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* benchmark_serving support --served-model-name param (vllm-project#12109)

Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

* [Misc] Add BNB support to GLM4-V model (vllm-project#12184)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1] Add V1 support of Qwen2-VL (vllm-project#12128)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Support for fairseq2 Llama (vllm-project#11442)

Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>

* [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [torch.compile] fix sym_tensor_indices (vllm-project#12191)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Move linting to `pre-commit` (vllm-project#11975)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [DOC] Fix typo in docstring and assert message (vllm-project#12194)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Model] Add Qwen2 PRM model support (vllm-project#12202)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] add placeholder format.sh (vllm-project#12206)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [CI/Build] Remove dummy CI steps (vllm-project#12208)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI/Build] Make pre-commit faster (vllm-project#12212)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core][bugfix] configure env var during import vllm (vllm-project#12209)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1] Remove `_get_cache_block_size` (vllm-project#12214)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Pass `attention` to impl backend (vllm-project#12218)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Using ROCm6.3.1 base docker and building hipblas-common (#366)

* [Misc] Update CODEOWNERS (vllm-project#12229)

* fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227)

Signed-off-by: isikhi <huseyin.isik000@gmail.com>

* [misc] add cuda runtime version to usage data (vllm-project#12190)

Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

* [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210)

Signed-off-by: Jason Cheng <jasoncky96@gmail.com>

* [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Add quantization and guided decoding CODEOWNERS (vllm-project#12228)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [ci/build] disable failed and flaky tests (vllm-project#12240)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration  (vllm-project#12237)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Misc] Remove redundant TypeVar from base model (vllm-project#12248)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [torch.compile] transparent compilation with more logging (vllm-project#12246)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Platform] improve platforms getattr (vllm-project#12264)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [ci/build] update nightly torch for gh200 test (vllm-project#12270)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802)

Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

* [Kernel] fix moe_align_block_size error condition (vllm-project#12239)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

* [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types  (vllm-project#10907)

Signed-off-by: rickyx <rickyx@anyscale.com>

* [Bugfix] Multi-sequence broken (vllm-project#11898)

Signed-off-by: Andy Lo <andy@mistral.ai>

* [Misc] Remove experimental dep from tracing.py (vllm-project#12007)

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>

* [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477)

* Update pre-commit.yml (#374)

* Update pre-commit.yml

* Reapplying missing format

* New codespell exclude location

---------

Co-authored-by: Kevin H. Luu <kevin@anyscale.com>

* [bugfix] moe tuning. rm is_navi() (vllm-project#12273)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277)

Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: maleksan85 <maleksan@amd.com>

* [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281)

Signed-off-by: Hongxia Yang <hongxyan@amd.com>

* [VLM] Simplify post-processing of replacement info (vllm-project#12269)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ci/lint] Add back default arg for pre-commit (vllm-project#12279)

Signed-off-by: kevin <kevin@anyscale.com>

* [CI] add docker volume prune to neuron CI (vllm-project#12291)

Signed-off-by: Liangfu Chen <liangfc@amazon.com>

* [Ci/Build] Fix mypy errors on main (vllm-project#12296)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [core] separate builder init and builder prepare for each batch (vllm-project#12253)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Build] update requirements of no-device (vllm-project#12299)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [Core] Support fully transparent sleep mode (vllm-project#11743)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [VLM] Avoid unnecessary tokenization (vllm-project#12310)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model][Bugfix]: correct Aria model output (vllm-project#12309)

Signed-off-by: xffxff <1247714429@qq.com>

* [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Doc] Add docs for prompt replacement (vllm-project#12318)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319)

Signed-off-by: wangerxiao <863579016@qq.com>

* [Misc]  Improve the readability of BNB error messages  (vllm-project#12320)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367)

* switching detokenize flag to be False

* detokenize = False for benchmarks

* restoring default in main vllm code for detokenize

* removing extra spaces

* moving detokenize to flag

* adding support for token ids

---------

Co-authored-by: maleksan85 <maleksan@amd.com>

* [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Core] Support `reset_prefix_cache` (vllm-project#12284)

* [Frontend][V1] Online serving performance improvements (vllm-project#12287)

* [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* FP8 FA fixes (#381)

* FP8 FA fixes

Summary:
Add missing clamp and fix reciprocal scale computation.

* linter

* Returning the use of the proper stream in allreduce (#382)

* [Bugfix] Fixing  AMD LoRA CI test. (vllm-project#12329)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Docs] Update FP8 KV Cache documentation (vllm-project#12238)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Docs] Document vulnerability disclosure process (vllm-project#12326)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1] Add `uncache_blocks` (vllm-project#12333)

* [doc] explain common errors around torch.compile (vllm-project#12340)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338)

Signed-off-by: zhenwei <zhenweiliu@habana.ai>

* [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Kernel] Flash Attention 3 Support (vllm-project#12093)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Doc] Troubleshooting errors during model inspection (vllm-project#12351)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1] Simplify M-RoPE (vllm-project#12352)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: imkero <kerorek@outlook.com>

* [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] add wake_up doc and some sanity check (vllm-project#12361)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

* [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Docs] Document Phi-4 support (vllm-project#12362)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (vllm-project#11528)

Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Docs] Add meetup slides (vllm-project#12345)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384)

* [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298)

Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

* Set weights_only=True when using torch.load() (vllm-project#12366)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Bugfix] Path join when building local path for S3 clone (vllm-project#12353)

Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>

* Update compressed-tensors version (vllm-project#12367)

* [V1] Increase default batch size for H100/H200 (vllm-project#12369)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [perf] fix perf regression from vllm-project#12253 (vllm-project#12380)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [ci/build] fix wheel size check (vllm-project#12396)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382)

* [ci/build] sync default value for wheel size (vllm-project#12398)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Enable proxy support in benchmark script (vllm-project#12356)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Applying scales rename to fp8 config (#387)

* [Misc] Remove deprecated code (vllm-project#12383)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Dev-docker Documentation Updates (#378)

* Dev-docker Documentation Updates

Minor updates to several sections, with links to other documents where appropriate.

* Fix formatting of GEMM filename

* README cleanup

- Reorder some sections of the README to make them easier to follow
- Improve formatting of bash commands
- Prefer use of huggingface model names instead of hard-coded directories
- Clean up wording

* Expanded sample commands for Latency and Throughput

* Fix markdown links

* Fix pre-commit errors

* Updates from review

Initial updates to incorporate feedback from a review session held with @t-parry

* Update script args to match current recommendations

* Remove recommended max-num-seqs values for now

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413)

* [Bugfix] Fix BLIP-2 processing (vllm-project#12412)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094)

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

* [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445)

* [Frontend] generation_config.json for  maximum tokens(vllm-project#12242)

Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450)

* [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Build/CI] Fix libcuda.so linkage (vllm-project#12424)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [Frontend] Rerank API (Jina- and Cohere-compatible API)  (vllm-project#12376)

Signed-off-by: Kyle Mistele <kyle@mistele.com>

* [DOC] Add link to vLLM blog (vllm-project#12460)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [V1] Avoid list creation in input preparation (vllm-project#12457)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Frontend] Support scores endpoint in run_batch (vllm-project#12430)

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

* [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [V1][Metrics] Add initial Prometheus logger (vllm-project#12416)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194)

Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* Support FP8 FA from Quark format (#388)

* Support FP8 FA from Quark format

* Support FP8 FA from Quark format

* nit: update comment

* Direct call on ROCm

* 20250127 docs update (#392)

* updating code blocks

* typo

* updated manifest

* Including feedback

* whitespace

* Deepseek instructions

* hyperlink fix

* hyperlink fix

* updating what is new

* cpx update

* typo

* whitespace

* whitespace

* Faster Custom Paged Attention kernels (#372)

* integrate new cpa kernel, update tests and benchmark

* added comments to mfma4 kernel

* further comments for mfma16 kernel

* clang-format

* Lint

* add flag for logits rtz conversion and disable by default

* lint

* [Bugfix]: Fix paged attention unit tests of #372 (#389)

* [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and  `csrc/rocm/attention.cu`.

* improve code documentation.

* lint

---------

Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

* Using a more precise profiling on ROCm to properly account for weights padding (#394)

* Update Dockerfile.rocm

* [Bugfix]: inclucde the env variables required for running FastSyncLLM

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* fix pre-commit lint

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Bugfix] included missing environment variable

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Ye Qi <yeq@meta.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Fred Reiss <frreiss@us.ibm.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: hongxyan <hongxyan@amd.com>
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Signed-off-by: Martin Gleize <mgleize@meta.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: isikhi <huseyin.isik000@gmail.com>
Signed-off-by: Jason Cheng <jasoncky96@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
Signed-off-by: maleksan85 <maleksan@amd.com>
Signed-off-by: Hongxia Yang <hongxyan@amd.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: wangerxiao <863579016@qq.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Kyle Mistele <kyle@mistele.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: YiSheng5 <yi.sheng@intel.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
Co-authored-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuan <yuan.zhou@intel.com>
Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com>
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com>
Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com>
Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: WangErXiao <863579016@qq.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Charles Frye <cfrye59@gmail.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: cennn <61925104+cennn@users.noreply.github.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: minmin <rmm0811@gmail.com>
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Fred Reiss <frreiss@us.ibm.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Keyun Tong <tongkeyun@gmail.com>
Co-authored-by: RunningLeon <maningsheng@sensetime.com>
Co-authored-by: kewang-xlnx <73578509+kewang-xlnx@users.noreply.github.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: tvirolai-amd <teemu.virolainen@amd.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: charlifu <charlifu@amd.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: yancong <32220263+ice-tong@users.noreply.github.com>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: gujing <925973396@qq.com>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: Işık <41375111+isikhi@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Jannis Schönleber <joennlae@gmail.com>
Co-authored-by: Ricky Xu <xuchen727@hotmail.com>
Co-authored-by: Andy Lo <andylolu24@gmail.com>
Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com>
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: zhou fan <1247714429@qq.com>
Co-authored-by: ilia-cher <30845429+ilia-cher@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: omer-dayan <omer@run.ai>
Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com>
Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Co-authored-by: Kyle Mistele <kyle@mistele.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Co-authored-by: sanyalington <shomy.sanyal@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
hongxiayang added a commit to ROCm/vllm that referenced this pull request Feb 19, 2025
* [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [HPU][misc] add comments for explanation (vllm-project#12034)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Kernel] Revert the API change of Attention.forward (vllm-project#12038)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Platform] Add output for Attention Backend (vllm-project#11981)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Explain where the engine args go when using Docker (vllm-project#12041)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Docs lint

* [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045)

* [Misc]  Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Kernel] Support MulAndSilu (vllm-project#11624)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Platform] move current_memory_usage() into platform (vllm-project#11369)

Signed-off-by: Shanshan Shen <467638484@qq.com>

* [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467)

* [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

* [core] platform agnostic executor via collective_rpc (vllm-project#11256)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003)

* Fix: cases with empty sparsity config (vllm-project#12057)

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

* Type-fix: make execute_model output type optional (vllm-project#12020)

* [Platform] Do not raise error if _Backend is not found (vllm-project#12023)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>

* [Model]: Support internlm3 (vllm-project#12037)

* Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765)

Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082)

* [Bugfix] use right truncation for non-generative tasks (vllm-project#12050)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

* [V1][Core] Autotune encoder cache budget (vllm-project#11895)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090)

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087)

* [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Doc] Add documentation for specifying model architecture (vllm-project#12105)

* Various cosmetic/comment fixes (vllm-project#12089)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Support torchrun and SPMD-style offline inference (vllm-project#12071)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Enable user marker for vllm profiling (#357)

* Enable user marker for vllm profiling

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

* [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068)

Signed-off-by: Isotr0py <2037008807@qq.com>

* Deepseek V3 support (#364)

* Changing the hard coded datatype to see if it's enough for the model to work

* Picking the upstrteam moe kernel version

* make upstream fix for v3 also works for rocm v2

* Conditional fnuz dtype

* Requantizing from fn to fnuz

* Requantizing moe as well

* Actually requantizing moe weights

* Conditional requantization and assert on padding in block quant

* Format

---------

Co-authored-by: charlifu <charlifu@amd.com>

* [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121)

Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

* [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Add deepseek_vl2 chat template (vllm-project#12143)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [ROCm][MoE] moe tuning support for rocm (vllm-project#12049)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

* [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [core] allow callable in collective_rpc (vllm-project#12151)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119)

Signed-off-by: Wallas Santos <wallashss@ibm.com>

* [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [torch.compile] disable logging when cache is disabled (vllm-project#12043)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [misc] fix cross-node TP (vllm-project#12166)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172)

Signed-off-by: hongxyan <hongxyan@amd.com>

* [core] further polish memory profiling (vllm-project#12126)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Docs] Fix broken link in SECURITY.md (vllm-project#12175)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Support register quantization method out-of-tree (vllm-project#11969)

* [V1] Collect env var for usage stats (vllm-project#12115)

* [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152)

Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>

* [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187)

* [torch.compile] store inductor compiled Python file (vllm-project#12182)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* benchmark_serving support --served-model-name param (vllm-project#12109)

Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

* [Misc] Add BNB support to GLM4-V model (vllm-project#12184)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1] Add V1 support of Qwen2-VL (vllm-project#12128)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Support for fairseq2 Llama (vllm-project#11442)

Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>

* [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074)

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

* [torch.compile] fix sym_tensor_indices (vllm-project#12191)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* Move linting to `pre-commit` (vllm-project#11975)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [DOC] Fix typo in docstring and assert message (vllm-project#12194)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [Model] Add Qwen2 PRM model support (vllm-project#12202)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] add placeholder format.sh (vllm-project#12206)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [CI/Build] Remove dummy CI steps (vllm-project#12208)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI/Build] Make pre-commit faster (vllm-project#12212)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [core][bugfix] configure env var during import vllm (vllm-project#12209)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1] Remove `_get_cache_block_size` (vllm-project#12214)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Misc] Pass `attention` to impl backend (vllm-project#12218)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* Using ROCm6.3.1 base docker and building hipblas-common (#366)

* [Misc] Update CODEOWNERS (vllm-project#12229)

* fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227)

Signed-off-by: isikhi <huseyin.isik000@gmail.com>

* [misc] add cuda runtime version to usage data (vllm-project#12190)

Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

* [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210)

Signed-off-by: Jason Cheng <jasoncky96@gmail.com>

* [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Add quantization and guided decoding CODEOWNERS (vllm-project#12228)

Signed-off-by: mgoin <michael@neuralmagic.com>

* [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230)

Signed-off-by: NickLucche <nlucches@redhat.com>

* [ci/build] disable failed and flaky tests (vllm-project#12240)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration  (vllm-project#12237)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Misc] Remove redundant TypeVar from base model (vllm-project#12248)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [torch.compile] transparent compilation with more logging (vllm-project#12246)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259)

Signed-off-by: Roger Wang <ywang@roblox.com>

* Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Platform] improve platforms getattr (vllm-project#12264)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [ci/build] update nightly torch for gh200 test (vllm-project#12270)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802)

Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

* [Kernel] fix moe_align_block_size error condition (vllm-project#12239)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

* [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types  (vllm-project#10907)

Signed-off-by: rickyx <rickyx@anyscale.com>

* [Bugfix] Multi-sequence broken (vllm-project#11898)

Signed-off-by: Andy Lo <andy@mistral.ai>

* [Misc] Remove experimental dep from tracing.py (vllm-project#12007)

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>

* [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477)

* Update pre-commit.yml (#374)

* Update pre-commit.yml

* Reapplying missing format

* New codespell exclude location

---------

Co-authored-by: Kevin H. Luu <kevin@anyscale.com>

* [bugfix] moe tuning. rm is_navi() (vllm-project#12273)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277)

Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: maleksan85 <maleksan@amd.com>

* [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281)

Signed-off-by: Hongxia Yang <hongxyan@amd.com>

* [VLM] Simplify post-processing of replacement info (vllm-project#12269)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ci/lint] Add back default arg for pre-commit (vllm-project#12279)

Signed-off-by: kevin <kevin@anyscale.com>

* [CI] add docker volume prune to neuron CI (vllm-project#12291)

Signed-off-by: Liangfu Chen <liangfc@amazon.com>

* [Ci/Build] Fix mypy errors on main (vllm-project#12296)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [core] separate builder init and builder prepare for each batch (vllm-project#12253)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Build] update requirements of no-device (vllm-project#12299)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [Core] Support fully transparent sleep mode (vllm-project#11743)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [VLM] Avoid unnecessary tokenization (vllm-project#12310)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Model][Bugfix]: correct Aria model output (vllm-project#12309)

Signed-off-by: xffxff <1247714429@qq.com>

* [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Doc] Add docs for prompt replacement (vllm-project#12318)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319)

Signed-off-by: wangerxiao <863579016@qq.com>

* [Misc]  Improve the readability of BNB error messages  (vllm-project#12320)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367)

* switching detokenize flag to be False

* detokenize = False for benchmarks

* restoring default in main vllm code for detokenize

* removing extra spaces

* moving detokenize to flag

* adding support for token ids

---------

Co-authored-by: maleksan85 <maleksan@amd.com>

* [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167)

Signed-off-by: Konrad Zawora <kzawora@habana.ai>

* [Core] Support `reset_prefix_cache` (vllm-project#12284)

* [Frontend][V1] Online serving performance improvements (vllm-project#12287)

* [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

* FP8 FA fixes (#381)

* FP8 FA fixes

Summary:
Add missing clamp and fix reciprocal scale computation.

* linter

* Returning the use of the proper stream in allreduce (#382)

* [Bugfix] Fixing  AMD LoRA CI test. (vllm-project#12329)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [Docs] Update FP8 KV Cache documentation (vllm-project#12238)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

* [Docs] Document vulnerability disclosure process (vllm-project#12326)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [V1] Add `uncache_blocks` (vllm-project#12333)

* [doc] explain common errors around torch.compile (vllm-project#12340)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338)

Signed-off-by: zhenwei <zhenweiliu@habana.ai>

* [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [Kernel] Flash Attention 3 Support (vllm-project#12093)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Doc] Troubleshooting errors during model inspection (vllm-project#12351)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [V1] Simplify M-RoPE (vllm-project#12352)

Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: imkero <kerorek@outlook.com>

* [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [core] add wake_up doc and some sanity check (vllm-project#12361)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

* [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Docs] Document Phi-4 support (vllm-project#12362)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order  (vllm-project#11528)

Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Docs] Add meetup slides (vllm-project#12345)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384)

* Integrated ater: kvcache pa gemm rmsnorm

* fix pa

* fix

* replace topk softmax

* [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* replace fp moe kernel with aiter kernel

* [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298)

Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

* Set weights_only=True when using torch.load() (vllm-project#12366)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

* [Bugfix] Path join when building local path for S3 clone (vllm-project#12353)

Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>

* change ater to aiter

* Update compressed-tensors version (vllm-project#12367)

* [V1] Increase default batch size for H100/H200 (vllm-project#12369)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [perf] fix perf regression from vllm-project#12253 (vllm-project#12380)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [ci/build] fix wheel size check (vllm-project#12396)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382)

* [ci/build] sync default value for wheel size (vllm-project#12398)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Enable proxy support in benchmark script (vllm-project#12356)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Applying scales rename to fp8 config

* Applying scales rename to fp8 config (#387)

* Update Dockerfile.rocm

* [Misc] Remove deprecated code (vllm-project#12383)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Using aiter moe kernel

* Dev-docker Documentation Updates (#378)

* Dev-docker Documentation Updates

Minor updates to several sections, with links to other documents where appropriate.

* Fix formatting of GEMM filename

* README cleanup

- Reorder some sections of the README to make them easier to follow
- Improve formatting of bash commands
- Prefer use of huggingface model names instead of hard-coded directories
- Clean up wording

* Expanded sample commands for Latency and Throughput

* Fix markdown links

* Fix pre-commit errors

* Updates from review

Initial updates to incorporate feedback from a review session held with @t-parry

* Update script args to match current recommendations

* Remove recommended max-num-seqs values for now

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

* [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413)

* [Bugfix] Fix BLIP-2 processing (vllm-project#12412)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094)

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

* [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439)

Signed-off-by: Roger Wang <ywang@roblox.com>

* fix pa copy

* pa update

* [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445)

* [Frontend] generation_config.json for  maximum tokens(vllm-project#12242)

Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* add fp16 pa support for aiter

* [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450)

* [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Build/CI] Fix libcuda.so linkage (vllm-project#12424)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [Frontend] Rerank API (Jina- and Cohere-compatible API)  (vllm-project#12376)

Signed-off-by: Kyle Mistele <kyle@mistele.com>

* [DOC] Add link to vLLM blog (vllm-project#12460)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [V1] Avoid list creation in input preparation (vllm-project#12457)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Frontend] Support scores endpoint in run_batch (vllm-project#12430)

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

* [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

Signed-off-by: Isotr0py <2037008807@qq.com>

* aiter build instructions

* [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Copy to the right path

* [V1][Metrics] Add initial Prometheus logger (vllm-project#12416)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194)

Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>

* Support FP8 FA from Quark format (#388)

* Support FP8 FA from Quark format

* Support FP8 FA from Quark format

* nit: update comment

* Direct call on ROCm

* 20250127 docs update (#392)

* updating code blocks

* typo

* updated manifest

* Including feedback

* whitespace

* Deepseek instructions

* hyperlink fix

* hyperlink fix

* updating what is new

* cpx update

* typo

* whitespace

* whitespace

* Add env var toggles to disable AITER MoE or PA (both by default on)

* Update accuracy benchmark for batch size > 1

* Add a few more AITER toggles for norm and linear layers

* Faster Custom Paged Attention kernels (#372)

* integrate new cpa kernel, update tests and benchmark

* added comments to mfma4 kernel

* further comments for mfma16 kernel

* clang-format

* Lint

* add flag for logits rtz conversion and disable by default

* lint

* [Bugfix]: Fix paged attention unit tests of #372 (#389)

* [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and  `csrc/rocm/attention.cu`.

* improve code documentation.

* lint

---------

Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

* Using a more precise profiling on ROCm to properly account for weights padding (#394)

* Public aiter repo

* Fail if aiter build failed silently

* Aiter can only be built on MI300x

* Typo fix

* Aiter PA off by default

* Changes to support updated aiter FP8 PA

* Support FP8 and INT8 KV cache according to ROCm/aiter#90

* add moe weight shuffle for dynamic quant and unquantized path

Signed-off-by: charlifu <charlifu@amd.com>

* Use FP16-native PA after support in ROCm/aiter#97

* Fix: Use FP8 pertoken quantize if KV cache dtype is FP8

* revert rocm_flash_attn.py line 883

* Don't enable by default to use an RC for main vllm-dev docker

* use ck moe for bf16 and fp16 fused_moe

* Merge remote-tracking branch 'origin/aiter_intergration_final' into merge-aiter-llama-fp8

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Bugfix] include moe shuffle env variable

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

---------

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Ye Qi <yeq@meta.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Fred Reiss <frreiss@us.ibm.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: hongxyan <hongxyan@amd.com>
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Signed-off-by: Martin Gleize <mgleize@meta.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: isikhi <huseyin.isik000@gmail.com>
Signed-off-by: Jason Cheng <jasoncky96@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
Signed-off-by: maleksan85 <maleksan@amd.com>
Signed-off-by: Hongxia Yang <hongxyan@amd.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: wangerxiao <863579016@qq.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Kyle Mistele <kyle@mistele.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: YiSheng5 <yi.sheng@intel.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
Co-authored-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Yuan <yuan.zhou@intel.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com>
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com>
Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com>
Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: WangErXiao <863579016@qq.com>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Charles Frye <cfrye59@gmail.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: cennn <61925104+cennn@users.noreply.github.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: minmin <rmm0811@gmail.com>
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Fred Reiss <frreiss@us.ibm.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Keyun Tong <tongkeyun@gmail.com>
Co-authored-by: RunningLeon <maningsheng@sensetime.com>
Co-authored-by: kewang-xlnx <73578509+kewang-xlnx@users.noreply.github.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: tvirolai-amd <teemu.virolainen@amd.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: charlifu <charlifu@amd.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: yancong <32220263+ice-tong@users.noreply.github.com>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: gujing <925973396@qq.com>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: Işık <41375111+isikhi@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cheng Kuan Yong Jason <jasoncky96@gmail.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Jannis Schönleber <joennlae@gmail.com>
Co-authored-by: Ricky Xu <xuchen727@hotmail.com>
Co-authored-by: Andy Lo <andylolu24@gmail.com>
Co-authored-by: Adrian Cole <64215+codefromthecrypt@users.noreply.github.com>
Co-authored-by: Jani Monoses <jani.monoses@gmail.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: zhou fan <1247714429@qq.com>
Co-authored-by: ilia-cher <30845429+ilia-cher@users.noreply.github.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: omer-dayan <omer@run.ai>
Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: ValarLip <340077269@qq.com>
Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Co-authored-by: Kyle Mistele <kyle@mistele.com>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Bowen Wang <abmfy@icloud.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: sanyalington <shomy.sanyal@amd.com>
Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: charlifu <chalifu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants