[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

shadowpa0327 · 2025-01-17T14:38:47Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

The current implementation of KV-Cache compression in SGLang provides robust support for DoubleSparsity (approximated attention with token selection, #1459 ) and FP8 quantization (#2786 ), enabling effective compression for long-context inference. However, two significant opportunities exist to further enhance its capabilities:

1. Support for Quest (Improved Approximated Attention Method) [Short term]

Recent studies, such as HashAttention paper [1], indicate that Quest [2] provides a more accurate approximation metric for attention compared to DoubleSparsity. By leveraging more precise attention approximations, we could improve accuracy and enable higher sparsity for cases where precision is critical.

2. Support the Combination with KV-Compression Approaches [Longterm]

DoubleSparsity retains all tokens in memory while selectively loading them for processing. As the context length increases, this retention can challenge memory requirements. While CPU offloading provides a feasible workaround, a promising enhancement is to jointly support token selection with quantization. This combined approach would:

Reduce memory requirements through quantization.
Enhance latency by fusing dequantization with kernel operations, optimizing runtime performance.

Additionally, supporting token eviction methods—permanently dropping non-important tokens—could further address memory constraints. As highlighted in #2510, token eviction methods (such as SnapKV[3] and PyramidKV[4]) would complement token selection by enabling aggressive memory management in scenarios with long contexts or resource limitations, such as streaming applications or low-resource deployments.

Expected Improvement

By incorporating these two enhancements, SGLang can achieve:

Improved Accuracy: Leveraging Quest for approximated attention will boost the performance and inference reliability.
Enhanced Memory Savings: Combining token selection with quantization and enabling token eviction will significantly optimize memory requirements, ensuring scalability for various deployment scenarios.

Related resources

HashAttention: https://arxiv.org/pdf/2412.14468v1
Quest: https://arxiv.org/abs/2406.10774
SnapKV: https://arxiv.org/abs/2404.14469
PyramidKV: https://arxiv.org/abs/2406.02069

cc @merrymercy

zhaochenyang20 · 2025-01-21T19:30:08Z

Thanks. I will ask @andy-yang-1 for help.

shadowpa0327 changed the title ~~[Feature] Enhancement on SparseAttention and KV-Cache Compression~~ [Feature] Enhancement on Sparse Attention and KV-Cache Compression Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

shadowpa0327 commented Jan 17, 2025 •

edited

Loading

zhaochenyang20 commented Jan 21, 2025

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Comments

shadowpa0327 commented Jan 17, 2025 • edited Loading

Checklist

Motivation

1. Support for Quest (Improved Approximated Attention Method) [Short term]

2. Support the Combination with KV-Compression Approaches [Longterm]

Expected Improvement

Related resources

zhaochenyang20 commented Jan 21, 2025

shadowpa0327 commented Jan 17, 2025 •

edited

Loading