Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Open
2 tasks done
shadowpa0327 opened this issue Jan 17, 2025 · 1 comment
Open
2 tasks done

Comments

@shadowpa0327
Copy link

shadowpa0327 commented Jan 17, 2025

Checklist

Motivation

The current implementation of KV-Cache compression in SGLang provides robust support for DoubleSparsity (approximated attention with token selection, #1459 ) and FP8 quantization (#2786 ), enabling effective compression for long-context inference. However, two significant opportunities exist to further enhance its capabilities:

1. Support for Quest (Improved Approximated Attention Method) [Short term]

Recent studies, such as HashAttention paper [1], indicate that Quest [2] provides a more accurate approximation metric for attention compared to DoubleSparsity. By leveraging more precise attention approximations, we could improve accuracy and enable higher sparsity for cases where precision is critical.

2. Support the Combination with KV-Compression Approaches [Longterm]

DoubleSparsity retains all tokens in memory while selectively loading them for processing. As the context length increases, this retention can challenge memory requirements. While CPU offloading provides a feasible workaround, a promising enhancement is to jointly support token selection with quantization. This combined approach would:

  • Reduce memory requirements through quantization.
  • Enhance latency by fusing dequantization with kernel operations, optimizing runtime performance.

Additionally, supporting token eviction methods—permanently dropping non-important tokens—could further address memory constraints. As highlighted in #2510, token eviction methods (such as SnapKV[3] and PyramidKV[4]) would complement token selection by enabling aggressive memory management in scenarios with long contexts or resource limitations, such as streaming applications or low-resource deployments.

Expected Improvement

By incorporating these two enhancements, SGLang can achieve:

  • Improved Accuracy: Leveraging Quest for approximated attention will boost the performance and inference reliability.
  • Enhanced Memory Savings: Combining token selection with quantization and enabling token eviction will significantly optimize memory requirements, ensuring scalability for various deployment scenarios.

Related resources

  1. HashAttention: https://arxiv.org/pdf/2412.14468v1
  2. Quest: https://arxiv.org/abs/2406.10774
  3. SnapKV: https://arxiv.org/abs/2404.14469
  4. PyramidKV: https://arxiv.org/abs/2406.02069

cc @merrymercy

@shadowpa0327 shadowpa0327 changed the title [Feature] Enhancement on SparseAttention and KV-Cache Compression [Feature] Enhancement on Sparse Attention and KV-Cache Compression Jan 17, 2025
@zhaochenyang20
Copy link
Collaborator

Thanks. I will ask @andy-yang-1 for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants