[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676

LucasWilkinson · 2025-02-03T05:50:55Z

Generally Nvidia hardware likes 256 byte alignment (reasons is foggy due to the blackbox nature of Nvidia hardware), but memory allocated via the CUDA Runtime ensure 256 byte alignment (see https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#a-sequential-but-misaligned-access-pattern).

This PR aligns KV cache entries to start 256 byte boundaries, this mainly targets MLA since for "normal attention" with normal head dims (say 64 or 128) the entries are naturally 256 byte aligned.

This means for MLA with a head dim of 576 (like DeepSeek V2/V3) and a fp16/bf16 cache, we allocate 640 elements per cache entry in instead of 576 (1280 bytes instead of 1152). This increases the size of the cache by ~11% (wasted), but leads to a worthwhile performance gain.

Results DeepSeek-R1 on 8xH200

VLLM_CUDA_MEM_ALIGN_KV_CACHE=0  python3 benchmarks/benchmark_throughput.py --model /data/nm/models/DeepSeek-R1 --trust-remote-code --tensor-parallel-size 8 --max-model-len 8000 --enable-chunked-prefill False --input-len 2000 --output-len 1000  --num-prompts 100
...
Throughput: 0.76 requests/s, 2289.10 total tokens/s, 763.03 output tokens/s

VLLM_CUDA_MEM_ALIGN_KV_CACHE=1  python3 benchmarks/benchmark_throughput.py --model /data/nm/models/DeepSeek-R1 --trust-remote-code --tensor-parallel-size 8 --max-model-len 8000 --enable-chunked-prefill False --input-len 2000 --output-len 1000  --num-prompts 100
...
Throughput: 1.10 requests/s, 3287.09 total tokens/s, 1095.70 output tokens/s

Accuracy:

VLLM_MLA_DISABLE=1 lm_eval --model vllm --model_args pretrained=/data/nm/models/DeepSeek-R1,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=False --task gsm8k --num_fewshot=5 --limit 100
...
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|


VLLM_MLA_DISABLE=0 VLLM_CUDA_MEM_ALIGN_KV_CACHE=0 lm_eval --model vllm --model_args pretrained=/data/nm/models/DeepSeek-R1,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=False --task gsm8k --num_fewshot=5 --limit 100
...
INFO 02-03 14:26:12 executor_base.py:110] # CUDA blocks: 30218, # CPU blocks: 3819
INFO 02-03 14:26:12 executor_base.py:115] Maximum concurrency for 16384 tokens per request: 29.51x
...
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|


VLLM_MLA_DISABLE=0 VLLM_CUDA_MEM_ALIGN_KV_CACHE=1 lm_eval --model vllm --model_args pretrained=/data/nm/models/DeepSeek-R1,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=False --task gsm8k --num_fewshot=5 --limit 100
...
INFO 02-03 14:33:20 executor_base.py:110] # CUDA blocks: 27196, # CPU blocks: 3437
INFO 02-03 14:33:20 executor_base.py:115] Maximum concurrency for 16384 tokens per request: 26.56x
...
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|

github-actions · 2025-02-03T05:51:05Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-02-03T05:51:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

vllm/worker/cache_engine.py

vllm/envs.py

tlrmchlsmth

Great find! Makes a lot of sense!

tlrmchlsmth · 2025-02-03T16:17:31Z

vllm/worker/cache_engine.py

+        if current_platform.is_cuda() and envs.VLLM_CUDA_MEM_ALIGN_KV_CACHE:
+            alloc_entry_size = align_to_256bytes(entry_size, self.dtype)
+        else:
+            alloc_entry_size = entry_size
+        alloc_shape = (*kv_cache_shape[:2], alloc_entry_size)
+
        for _ in range(self.num_attention_layers):
            # null block in CpuGpuBlockAllocator requires at least that
            # block to be zeroed-out.
            # We zero-out everything for simplicity.
-            kv_cache.append(
-                torch.zeros(kv_cache_shape,
-                            dtype=self.dtype,
-                            pin_memory=pin_memory,
-                            device=device))
+            layer_kv_cache = torch.zeros(alloc_shape,
+                                         dtype=self.dtype,
+                                         pin_memory=pin_memory,
+                                         device=device)
+
+            if alloc_entry_size != entry_size:
+                layer_kv_cache = layer_kv_cache[..., :entry_size]
+
+            kv_cache.append(layer_kv_cache.view(kv_cache_shape))


The implementation looks good to me. A couple of comments noting what the padding and views are doing would be nice to make it a little easier to follow (as well as noting that this is kind of a special case for MLA).

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

mgoin · 2025-02-03T16:18:00Z

vllm/envs.py

@@ -539,6 +540,15 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
    "VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON":
    lambda: bool(int(os.getenv("VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON", "0"))
                 ),
+
+    # When on a Nvidia GPU aligns single entrys (within a page) so they are 256
+    # byte aligned for  better performance, this increases the memory usage of 


Suggested change

# byte aligned for better performance, this increases the memory usage of

# byte aligned for better performance, this increases the memory usage of

mgoin · 2025-02-03T16:27:07Z

vllm/worker/cache_engine.py

@@ -75,15 +80,30 @@ def _allocate_kv_cache(
            num_blocks, self.block_size, self.num_kv_heads, self.head_size)
        pin_memory = is_pin_memory_available() if device == "cpu" else False
        kv_cache: List[torch.Tensor] = []
+
+        entry_shape = kv_cache_shape[2:]


We should assert/deal with if the num dimensions is what we expect and/or possibly reverse index to deal with different shapes

For instance:
Flash attention has 5 dims

vllm/vllm/v1/attention/backends/flash_attn.py

Line 53 in a1a2aaa

return (2, num_blocks, block_size, num_kv_heads, head_size)

Pallas attention has 4 dims

vllm/vllm/attention/backends/pallas.py

Line 40 in a1a2aaa

return (num_kv_heads, num_blocks, block_size, head_size)

Triton MLA has 3 dims

vllm/vllm/attention/backends/triton_mla.py

Line 67 in a1a2aaa

return (num_blocks, block_size, head_size)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson · 2025-02-03T16:37:11Z

Sorry! hold on there may be accuracy issues

Edit: Accuracy issues resolved

simon-mo · 2025-02-03T16:40:39Z

vllm/envs.py

+    "VLLM_CUDA_MEM_ALIGN_KV_CACHE":
+    lambda: bool(int(os.getenv("VLLM_CUDA_MEM_ALIGN_KV_CACHE", "1"))),


Do we need to flag this? I think we can just default to this behavior without switching back.

I agree with the concern about reducing the cache space by ~11%, although maybe we consider this change necessary for performance to remove the choice like you say.

This means for MLA with a head dim of 576 (like DeepSeek V2/V3) and a fp16/bf16 cache, we allocate 640 elements per cache entry in instead of 576 (1280 bytes instead of 1152). This increases the size of the cache by ~11% (wasted), but leads to a worthwhile performance gain.

Given that it can increase the size of the KV-cache I wanted it on by default but with a flag to turn it off incase a user really wants to maximize KV-cache size

I think we can just turn this on by default?
it is on by default already (default value is "1")

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

…(MLA perf improvement)

Signed-off-by: simon-mo <xmo@berkeley.edu>

tlrmchlsmth · 2025-02-04T00:24:55Z

vllm/_custom_ops.py

+def copy_blocks_mla(kv_caches: List[torch.Tensor],
+                    block_mapping: torch.Tensor) -> None:
+    torch.ops._C_cache_ops.copy_blocks_mla(kv_caches, block_mapping)


Why didn't we need this kernel before?

I think we did ..., I think this may solve some bugs (TBH im not sure how copy_blocks is used by the wider system)

tlrmchlsmth

Great work -- LGTM!

…llm-project#12676) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Felix Marty <felmarty@amd.com>

…llm-project#12676) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>

leepoly · 2025-02-12T02:19:55Z

Hi, I’m using a 8xH200 setup and unable to reproduce the results from benchmarks/benchmark_throughput.py using the latest 0.7.2 version.

My results:
With VLLM_CUDA_MEM_ALIGN_KV_CACHE=0:

CUDA blocks: 34,497
Throughput: 1.06 requests/s, 3,170.77 total tokens/s, 1,056.92 output tokens/s

With VLLM_CUDA_MEM_ALIGN_KV_CACHE=1:

CUDA blocks: 31,047
Throughput: 1.06 requests/s, 3,185.99 total tokens/s, 1,062.00 output tokens/s

The throughput seems nearly identical in both cases. Could you suggest potential causes for this discrepancy?

LucasWilkinson · 2025-02-12T05:04:41Z

@leepoly what model is this? this only affects MLA (i.e. DeepSeek V2/3)

Edit: nvm I assume you are using R1 since the numbers look very comparable, ill try to re-run the numbers tmrw to see if there is something weird going on

leepoly · 2025-02-12T06:14:08Z

@leepoly what model is this? this only affects MLA (i.e. DeepSeek V2/3)

Edit: nvm I assume you are using R1 since the numbers look very comparable, ill try to re-run the numbers tmrw to see if there is something weird going on

Yes I use deepseek v3 model. And I simply used the script you provided VLLM_CUDA_MEM_ALIGN_KV_CACHE=0 python3 benchmarks/benchmark_throughput.py --model /data/nm/models/DeepSeek-R1 --trust-remote-code --tensor-parallel-size 8 --max-model-len 8000 --enable-chunked-prefill False --input-len 2000 --output-len 1000 --num-prompts 100

Even with VLLM_CUDA_MEM_ALIGN_KV_CACHE=0 the reported throughput (1.06rps) already roughly matches with your results with 256B aligned (1.10 rps).

…llm-project#12676) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>

mergify bot added the needs-rebase label Feb 3, 2025

robertgshaw2-redhat force-pushed the lwilkinson/mem-align-kv-cache branch from 59ab887 to 4d3d413 Compare February 3, 2025 15:59

LucasWilkinson added 2 commits February 3, 2025 07:59

align kv-cache mem

3f97f22

Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

update comment

bd75f96

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

robertgshaw2-redhat force-pushed the lwilkinson/mem-align-kv-cache branch from 4d3d413 to bd75f96 Compare February 3, 2025 15:59

mergify bot removed the needs-rebase label Feb 3, 2025

tlrmchlsmth reviewed Feb 3, 2025

View reviewed changes

vllm/worker/cache_engine.py Show resolved Hide resolved

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

vllm/envs.py Outdated Show resolved Hide resolved

vllm/envs.py Outdated Show resolved Hide resolved

tlrmchlsmth approved these changes Feb 3, 2025

View reviewed changes

format

f529a56

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

mgoin reviewed Feb 3, 2025

View reviewed changes

review comments

b1d9e39

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

simon-mo reviewed Feb 3, 2025

View reviewed changes

accuracy fixes

7acaf16

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

simon-mo mentioned this pull request Feb 3, 2025

v0.7.2 Release Tracker #12700

Closed

6 tasks

LucasWilkinson added 4 commits February 3, 2025 17:56

accuracy fixes cont.

86e4783

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

restrict alignment to MLA, update copy blocks

a2c293b

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

add copy blocks and tests

60bd036

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

swap blocks

cd9a5bd

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson requested a review from WoosukKwon as a code owner February 3, 2025 21:37

fixes

68f965f

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

yessenzhar pushed a commit to deepinfra/vllm that referenced this pull request Feb 3, 2025

Apply vllm-project#12676 [Perf] Mem align KV caches for CUDA devices …

1704500

…(MLA perf improvement)

cleanup

93e4384

Signed-off-by: simon-mo <xmo@berkeley.edu>

tlrmchlsmth reviewed Feb 4, 2025

View reviewed changes

tlrmchlsmth approved these changes Feb 4, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025

tlrmchlsmth enabled auto-merge (squash) February 4, 2025 14:04

robertgshaw2-redhat mentioned this pull request Feb 4, 2025

[V1] Feedback Thread #12568

Open

mgoin approved these changes Feb 4, 2025

View reviewed changes

simon-mo disabled auto-merge February 5, 2025 02:22

simon-mo merged commit 75e9430 into vllm-project:main Feb 5, 2025
71 of 73 checks passed

mgoin mentioned this pull request Feb 7, 2025

[Bug]: Non-coherent output from DeepSeek-R1 671B on H200 SXM #12892

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676

[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676

LucasWilkinson commented Feb 3, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 3, 2025

mergify bot commented Feb 3, 2025

tlrmchlsmth left a comment

tlrmchlsmth Feb 3, 2025

mgoin Feb 3, 2025

mgoin Feb 3, 2025

LucasWilkinson commented Feb 3, 2025 •

edited

Loading

simon-mo Feb 3, 2025 •

edited

Loading

mgoin Feb 3, 2025

LucasWilkinson Feb 3, 2025 •

edited

Loading

tlrmchlsmth Feb 4, 2025

LucasWilkinson Feb 4, 2025

tlrmchlsmth left a comment

leepoly commented Feb 12, 2025 •

edited

Loading

LucasWilkinson commented Feb 12, 2025 •

edited

Loading

leepoly commented Feb 12, 2025

	# byte aligned for better performance, this increases the memory usage of
	# byte aligned for better performance, this increases the memory usage of

		"VLLM_CUDA_MEM_ALIGN_KV_CACHE":
		lambda: bool(int(os.getenv("VLLM_CUDA_MEM_ALIGN_KV_CACHE", "1"))),

[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676

[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676

Conversation

LucasWilkinson commented Feb 3, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 3, 2025

mergify bot commented Feb 3, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

tlrmchlsmth Feb 3, 2025

Choose a reason for hiding this comment

mgoin Feb 3, 2025

Choose a reason for hiding this comment

mgoin Feb 3, 2025

Choose a reason for hiding this comment

LucasWilkinson commented Feb 3, 2025 • edited Loading

simon-mo Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

mgoin Feb 3, 2025

Choose a reason for hiding this comment

LucasWilkinson Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

tlrmchlsmth Feb 4, 2025

Choose a reason for hiding this comment

LucasWilkinson Feb 4, 2025

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

leepoly commented Feb 12, 2025 • edited Loading

LucasWilkinson commented Feb 12, 2025 • edited Loading

leepoly commented Feb 12, 2025

LucasWilkinson commented Feb 3, 2025 •

edited by github-actions bot

Loading

LucasWilkinson commented Feb 3, 2025 •

edited

Loading

simon-mo Feb 3, 2025 •

edited

Loading

LucasWilkinson Feb 3, 2025 •

edited

Loading

leepoly commented Feb 12, 2025 •

edited

Loading

LucasWilkinson commented Feb 12, 2025 •

edited

Loading