Skip to content

TreeAI-Lab/Awesome-KV-Cache-Management

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 

Repository files navigation

Awesome-KV-Cache-Management

A Survey on Large Language Model Acceleration based on KV Cache Management [PDF]

Haoyang Li 1, Yiming Li 2, Anxin Tian 2, Tinahao Tang 2, Zhanchao Xu 4, Xuejia Chen 4, Nicole Hu 3, Wei Dong 5, Qing Li 1, Lei Chen 2

1Hong Kong Polytechnic University, 2Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Huazhong University of Science and Technology, 5Nanyang Technological University.

  • This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly.

  • If you find this survey helpful for your work, please consider citing it.

  @article{li2024surveylargelanguagemodel,
      title={A Survey on Large Language Model Acceleration based on KV Cache Management}, 
      author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
      journal={arXiv preprint arXiv:2412.19442},
      year={2024}
  }
  • If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to (haoyang-comp.li@polyu.edu.hk) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!

Toxonomy and Papers


Token-level Optimization

KV Cache Selection

Static KV Cache Selection (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs Static KV Cache Selection ICLR Link
2024 SnapKV: LLM Knows What You are Looking for Before Generation Static KV Cache Selection NeurIPS Link Link
2024 In-context KV-Cache Eviction for LLMs via Attention-Gate Static KV Cache Selection arXiv Link

Dynamic Selection with Permanent Eviction (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference Dynamic Selection with Permanent Eviction MLSys Link
2024 BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference Dynamic Selection with Permanent Eviction arXiv Link Link
2024 NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time Dynamic Selection with Permanent Eviction ACL Link Link
2023 H2O: heavy-hitter oracle for efficient generative inference of large language models Dynamic Selection with Permanent Eviction NeurIPS Link Link
2023 Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time Dynamic Selection with Permanent Eviction NeurIPS Link

Dynamic Selection without Permanent Eviction (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory Dynamic Selection without Permanent Eviction arXiv Link Link
2024 Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Dynamic Selection without Permanent Eviction ICML Link Link
2024 PQCache: Product Quantization-based KVCache for Long Context LLM Inference Dynamic Selection without Permanent Eviction arXiv Link
2024 Squeezed Attention: Accelerating Long Context Length LLM Inference Dynamic Selection without Permanent Eviction arXiv Link Link
2024 RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Dynamic Selection without Permanent Eviction arXiv Link Link
2024 Human-like Episodic Memory for Infinite Context LLMs Dynamic Selection without Permanent Eviction arXiv Link
2024 ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression Dynamic Selection without Permanent Eviction arXiv Link

KV Cache Budget Allocation

Layer-wise Budget Allocation (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling Layer-wise Budget Allocation arXiv Link Link
2024 PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference Layer-wise Budget Allocation Findings Link Link
2024 DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs Layer-wise Budget Allocation ICLR sub. Link
2024 PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation Layer-wise Budget Allocation arXiv Link Link
2024 SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction Layer-wise Budget Allocation arXiv Link Link

Head-wise Budget Allocation (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference Head-wise Budget Allocation arXiv Link
2024 Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective Head-wise Budget Allocation ICLR sub. Link
2024 Unifying KV Cache Compression for Large Language Models with LeanKV Head-wise Budget Allocation arXiv Link
2024 RazorAttention: Efficient KV Cache Compression Through Retrieval Heads Head-wise Budget Allocation arXiv Link
2024 Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning Head-wise Budget Allocation arXiv Link Link
2024 DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Head-wise Budget Allocation arXiv Link Link

KV Cache Merging

Intra-layer Merging (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Compressed Context Memory for Online Language Model Interaction Intra-layer Merging ICLR Link Link
2024 LoMA: Lossless Compressed Memory Attention Intra-layer Merging arXiv Link
2024 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference Intra-layer Merging ICML Link Link
2024 CaM: Cache Merging for Memory-efficient LLMs Inference Intra-layer Merging ICML Link Link
2024 D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models Intra-layer Merging arXiv Link
2024 AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Intra-layer Merging arXiv Link Link
2024 LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference Intra-layer Merging EMNLP Link Link
2024 Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks Intra-layer Merging arXiv Link
2024 CHAI: Clustered Head Attention for Efficient LLM Inference Intra-layer Merging arXiv Link

Cross-layer Merging (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 MiniCache: KV Cache Compression in Depth Dimension for Large Language Models Cross-layer Merging arXiv Link Link
2024 KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer Sharing Cross-layer Merging arXiv Link Link

KV Cache Quantization

Fixed-precision Quantization (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead Fixed-precision Quantization arXiv Link Link
2024 PQCache: Product Quantization-based KVCache for Long Context LLM Inference Fixed-precision Quantization arXiv Link
2023 FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Fixed-precision Quantization ICML Link Link
2022 ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers Fixed-precision Quantization NIPS Link Link

Mixed-precision Quantization (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Mixed-precision Quantization arXiv Link Link
2024 IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact Mixed-precision Quantization arXiv Link Link
2024 SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models Mixed-precision Quantization arXiv Link Link
2024 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache Mixed-precision Quantization arXiv Link Link
2024 WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More Mixed-precision Quantization arXiv Link
2024 GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM Mixed-precision Quantization arXiv Link Link
2024 No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization Mixed-precision Quantization arXiv Link
2024 ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification Mixed-precision Quantization arXiv Link
2024 ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification Mixed-precision Quantization arXiv Link Link
2024 PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs Mixed-precision Quantization arXiv Link Link
2024 MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache Mixed-precision Quantization arXiv Link

Outlier Redistribution (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Massive Activations in Large Language Models Outlier Redistribution arXiv Link Link
2024 QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs Outlier Redistribution arXiv Link Link
2024 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Outlier Redistribution arXiv Link Link
2024 SpinQuant: LLM Quantization with Learned Rotations Outlier Redistribution arXiv Link Link
2024 DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs Outlier Redistribution NeurIPS Link Link
2024 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models Outlier Redistribution ICML Link Link
2024 Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling Outlier Redistribution EMNLP Link Link
2024 AffineQuant: Affine Transformation Quantization for Large Language Models Outlier Redistribution arXiv Link Link
2024 FlatQuant: Flatness Matters for LLM Quantization Outlier Redistribution arXiv Link Link
2024 AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration Outlier Redistribution MLSys Link Link
2023 OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models Outlier Redistribution arXiv Link Link
2023 Training Transformers with 4-bit Integers Outlier Redistribution NeurIPS Link Link

KV Cache Low-rank Decomposition

Singular Value Decomposition (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Effectively Compress KV Heads for LLM Singular Value Decomposition arXiv Link
2024 Eigen Attention: Attention in Low-Rank Space for KV Cache Compression Singular Value Decomposition arXiv Link Link
2024 Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference Singular Value Decomposition arXiv Link
2024 LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy Singular Value Decomposition arXiv Link
2024 ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference Singular Value Decomposition arXiv Link Link
2024 Palu: Compressing KV-Cache with Low-Rank Projection Singular Value Decomposition arXiv Link Link

Tensor Decomposition (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression Tensor Decomposition ACL Link Link

Learned Low-rank Approximation (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference Learned Low-rank Approximation arXiv Link Link
2024 MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection Learned Low-rank Approximation arXiv Link

Model-level Optimization

Attention Grouping and Sharing

Intra-Layer Grouping (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2019 Fast Transformer Decoding: One Write-Head is All You Need Intra-Layer Grouping arXiv Link
2023 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Intra-Layer Grouping EMNLP Link Link
2024 Optimised Grouped-Query Attention Mechanism for Transformers Intra-Layer Grouping ICML Link
2024 Weighted Grouped Query Attention in Transformers Intra-Layer Grouping arXiv Link
2024 QCQA: Quality and Capacity-aware grouped Query Attention Intra-Layer Grouping arXiv Link Non-official Link
2024 Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention Intra-Layer Grouping arXiv Link Link
2023 GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values Intra-Layer Grouping NeurIPS Link

Cross-Layer Sharing (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention Cross-Layer Sharing arXiv Link Non-official Link
2024 Layer-Condensed KV Cache for Efficient Inference of Large Language Models Cross-Layer Sharing ACL Link Link
2024 Beyond KV Caching: Shared Attention for Efficient LLMs Cross-Layer Sharing arXiv Link Link
2024 MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding Cross-Layer Sharing arXiv Link Link
2024 Cross-layer Attention Sharing for Large Language Models Cross-Layer Sharing arXiv Link
2024 A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference Cross-Layer Sharing arXiv Link
2024 Lossless KV Cache Compression to 2% Cross-Layer Sharing arXiv Link
2024 DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion Cross-Layer Sharing NeurIPS Link
2024 Value Residual Learning For Alleviating Attention Concentration In Transformers Cross-Layer Sharing arXiv Link Link

Architecture Alteration

Enhanced Attention (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Enhanced Attention arXiv Link Link
2022 Transformer Quality in Linear Time Enhanced Attention ICML Link
2024 Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention Enhanced Attention arXiv Link

Augmented Architecture (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 You Only Cache Once: Decoder-Decoder Architectures for Language Models Augmented Architecture arXiv Link Link
2024 Long-Context Language Modeling with Parallel Context Encoding Augmented Architectures ACL Link Link
2024 XC-CACHE: Cross-Attending to Cached Context for Efficient LLM Inference Augmented Architectures Findings Link
2024 Block Transformer: Global-to-Local Language Modeling for Fast Inference Augmented Architectures arXiv Link Link

Non-transformer Architecture

Adaptive Sequence Processing Architecture (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2023 RWKV: Reinventing RNNs for the Transformer Era Adaptive Sequence Processing Architecture Findings Link Link
2024 Mamba: Linear-Time Sequence Modeling with Selective State Spaces Adaptive Sequence Processing Architecture arXiv Link Link
2023 Retentive Network: A Successor to Transformer for Large Language Models Adaptive Sequence Processing Architecture arXiv Link Link
2024 MCSD: An Efficient Language Model with Diverse Fusion Adaptive Sequence Processing Architecture arXiv Link

Hybrid Architecture (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling Hybrid Architecture IOS Press Link
2024 GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression Hybrid Architecture arXiv Link Link
2024 RecurFormer: Not All Transformer Heads Need Self-Attention Hybrid Architecture arXiv Link

System-level Optimization

Memory Management

Architectural Design (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving Architectural Design arXiv Link Link
2024 Unifying KV Cache Compression for Large Language Models with LeanKV Architectural Design arXiv Link
2023 Efficient Memory Management for Large Language Model Serving with PagedAttention Architectural Design SOSP Link Link

Prefix-aware Design (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Prefix-aware Design ACL Link Link
2024 MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCaching Prefix-aware Design arXiv Link

Scheduling

Prefix-aware Scheduling (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching Prefix-aware Scheduling arXiv Link
2024 SGLang: Efficient Execution of Structured Language Model Programs Prefix-aware Scheduling NeurIPS Link Link

Preemptive and Fairness-oriented Scheduling (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Fast Distributed Inference Serving for Large Language Models Preemptive and Fairness-oriented Scheduling arXiv Link
2024 FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVING Preemptive and Fairness-oriented Scheduling arXiv Link

Layer-specific and Hierarchical Scheduling (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management Layer-specific and Hierarchical Scheduling arXiv Link Link
2024 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention Layer-specific and Hierarchical Scheduling USENIX ATC Link
2024 ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching Layer-specific and Hierarchical Scheduling ISCA Link
2024 Fast Inference for Augmented Large Language Models Layer-specific and Hierarchical Scheduling arXiv Link

Hardware-aware Design

Single/Multi-GPU Design (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Hydragen: High-Throughput LLM Inference with Shared Prefixes Single/Multi-GPU Design arXiv Link Link
2024 DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference Single/Multi-GPU Design arXiv Link
2024 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Single/Multi-GPU Design OSDI Link Link
2024 Multi-Bin Batching for Increasing LLM Inference Throughput Single/Multi-GPU Design arXiv Link
2024 Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters Single/Multi-GPU Design arXiv Link Link
2023 Efficient Memory Management for Large Language Model Serving with PagedAttention Single/Multi-GPU Design SOSP Link Link
2022 Orca: A Distributed Serving System for Transformer-Based Generative Models Single/Multi-GPU Design OSDI Link

I/O-based Design (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs I/O-based Design arXiv Link Link
2024 Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation I/O-based Design arXiv Link
2024 Fast State Restoration in LLM Serving with HCache I/O-based Design arXiv Link
2024 Compute Or Load KV Cache? Why Not Both? I/O-based Design arXiv Link
2024 FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving I/O-based Design arXiv Link
2022 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness I/O-based Design NeurIPS Link Link

Heterogeneous Design (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference Heterogeneous Design arXiv Link
2024 FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines Heterogeneous Design arXiv Link
2024 vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving Heterogeneous Design arXiv Link
2024 InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management Heterogeneous Design arXiv Link
2024 Fast Distributed Inference Serving for Large Language Models Heterogeneous Design arXiv Link
2024 Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation Heterogeneous Design arXiv Link
2023 Stateful Large Language Model Serving with Pensieve Heterogeneous Design arXiv Link

SSD-based Design (To Top๐Ÿ‘†๐Ÿป)

Year Title Type Venue Paper code
2024 InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference SSD-based Design arXiv Link
2023 FlexGen: High-Throughput Generative Inference of Large Language Models SSD-based Design ICML Link Link

Datasets and Benchmarks

Please refer to our paper for detailed information on this section.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published