A Survey on Large Language Model Acceleration based on KV Cache Management [PDF]
Haoyang Li 1, Yiming Li 2, Anxin Tian 2, Tinahao Tang 2, Zhanchao Xu 4, Xuejia Chen 4, Nicole Hu 3, Wei Dong 5, Qing Li 1, Lei Chen 2
1Hong Kong Polytechnic University, 2Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Huazhong University of Science and Technology, 5Nanyang Technological University.
-
This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly.
-
If you find this survey helpful for your work, please consider citing it.
@article{li2024surveylargelanguagemodel,
title={A Survey on Large Language Model Acceleration based on KV Cache Management},
author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
journal={arXiv preprint arXiv:2412.19442},
year={2024}
}
- If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to (haoyang-comp.li@polyu.edu.hk) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!
- Awesome-KV-Cache-Management
- Token-level Optimization
- Model-level Optimization
- System-level Optimization
- Datasets and Benchmarks
Static KV Cache Selection (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | Static KV Cache Selection | ICLR | Link | |
2024 | SnapKV: LLM Knows What You are Looking for Before Generation | Static KV Cache Selection | NeurIPS | Link | Link |
2024 | In-context KV-Cache Eviction for LLMs via Attention-Gate | Static KV Cache Selection | arXiv | Link |
Dynamic Selection with Permanent Eviction (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | Dynamic Selection with Permanent Eviction | MLSys | Link | |
2024 | BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference | Dynamic Selection with Permanent Eviction | arXiv | Link | Link |
2024 | NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time | Dynamic Selection with Permanent Eviction | ACL | Link | Link |
2023 | H2O: heavy-hitter oracle for efficient generative inference of large language models | Dynamic Selection with Permanent Eviction | NeurIPS | Link | Link |
2023 | Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | Dynamic Selection with Permanent Eviction | NeurIPS | Link |
Dynamic Selection without Permanent Eviction (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory | Dynamic Selection without Permanent Eviction | arXiv | Link | Link |
2024 | Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | Dynamic Selection without Permanent Eviction | ICML | Link | Link |
2024 | PQCache: Product Quantization-based KVCache for Long Context LLM Inference | Dynamic Selection without Permanent Eviction | arXiv | Link | |
2024 | Squeezed Attention: Accelerating Long Context Length LLM Inference | Dynamic Selection without Permanent Eviction | arXiv | Link | Link |
2024 | RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval | Dynamic Selection without Permanent Eviction | arXiv | Link | Link |
2024 | Human-like Episodic Memory for Infinite Context LLMs | Dynamic Selection without Permanent Eviction | arXiv | Link | |
2024 | ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression | Dynamic Selection without Permanent Eviction | arXiv | Link |
Layer-wise Budget Allocation (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | Layer-wise Budget Allocation | arXiv | Link | Link |
2024 | PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference | Layer-wise Budget Allocation | Findings | Link | Link |
2024 | DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs | Layer-wise Budget Allocation | ICLR sub. | Link | |
2024 | PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation | Layer-wise Budget Allocation | arXiv | Link | Link |
2024 | SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction | Layer-wise Budget Allocation | arXiv | Link | Link |
Head-wise Budget Allocation (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference | Head-wise Budget Allocation | arXiv | Link | |
2024 | Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective | Head-wise Budget Allocation | ICLR sub. | Link | |
2024 | Unifying KV Cache Compression for Large Language Models with LeanKV | Head-wise Budget Allocation | arXiv | Link | |
2024 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | Head-wise Budget Allocation | arXiv | Link | |
2024 | Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning | Head-wise Budget Allocation | arXiv | Link | Link |
2024 | DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads | Head-wise Budget Allocation | arXiv | Link | Link |
Intra-layer Merging (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Compressed Context Memory for Online Language Model Interaction | Intra-layer Merging | ICLR | Link | Link |
2024 | LoMA: Lossless Compressed Memory Attention | Intra-layer Merging | arXiv | Link | |
2024 | Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | Intra-layer Merging | ICML | Link | Link |
2024 | CaM: Cache Merging for Memory-efficient LLMs Inference | Intra-layer Merging | ICML | Link | Link |
2024 | D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models | Intra-layer Merging | arXiv | Link | |
2024 | AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | Intra-layer Merging | arXiv | Link | Link |
2024 | LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference | Intra-layer Merging | EMNLP | Link | Link |
2024 | Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks | Intra-layer Merging | arXiv | Link | |
2024 | CHAI: Clustered Head Attention for Efficient LLM Inference | Intra-layer Merging | arXiv | Link |
Cross-layer Merging (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | Cross-layer Merging | arXiv | Link | Link |
2024 | KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer Sharing | Cross-layer Merging | arXiv | Link | Link |
Fixed-precision Quantization (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead | Fixed-precision Quantization | arXiv | Link | Link |
2024 | PQCache: Product Quantization-based KVCache for Long Context LLM Inference | Fixed-precision Quantization | arXiv | Link | |
2023 | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | Fixed-precision Quantization | ICML | Link | Link |
2022 | ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | Fixed-precision Quantization | NIPS | Link | Link |
Mixed-precision Quantization (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | Mixed-precision Quantization | arXiv | Link | Link |
2024 | IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact | Mixed-precision Quantization | arXiv | Link | Link |
2024 | SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | Mixed-precision Quantization | arXiv | Link | Link |
2024 | KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache | Mixed-precision Quantization | arXiv | Link | Link |
2024 | WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More | Mixed-precision Quantization | arXiv | Link | |
2024 | GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM | Mixed-precision Quantization | arXiv | Link | Link |
2024 | No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization | Mixed-precision Quantization | arXiv | Link | |
2024 | ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification | Mixed-precision Quantization | arXiv | Link | |
2024 | ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification | Mixed-precision Quantization | arXiv | Link | Link |
2024 | PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs | Mixed-precision Quantization | arXiv | Link | Link |
2024 | MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache | Mixed-precision Quantization | arXiv | Link |
Outlier Redistribution (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Massive Activations in Large Language Models | Outlier Redistribution | arXiv | Link | Link |
2024 | QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Outlier Redistribution | arXiv | Link | Link |
2024 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | Outlier Redistribution | arXiv | Link | Link |
2024 | SpinQuant: LLM Quantization with Learned Rotations | Outlier Redistribution | arXiv | Link | Link |
2024 | DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs | Outlier Redistribution | NeurIPS | Link | Link |
2024 | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | Outlier Redistribution | ICML | Link | Link |
2024 | Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling | Outlier Redistribution | EMNLP | Link | Link |
2024 | AffineQuant: Affine Transformation Quantization for Large Language Models | Outlier Redistribution | arXiv | Link | Link |
2024 | FlatQuant: Flatness Matters for LLM Quantization | Outlier Redistribution | arXiv | Link | Link |
2024 | AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration | Outlier Redistribution | MLSys | Link | Link |
2023 | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | Outlier Redistribution | arXiv | Link | Link |
2023 | Training Transformers with 4-bit Integers | Outlier Redistribution | NeurIPS | Link | Link |
Singular Value Decomposition (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Effectively Compress KV Heads for LLM | Singular Value Decomposition | arXiv | Link | |
2024 | Eigen Attention: Attention in Low-Rank Space for KV Cache Compression | Singular Value Decomposition | arXiv | Link | Link |
2024 | Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference | Singular Value Decomposition | arXiv | Link | |
2024 | LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy | Singular Value Decomposition | arXiv | Link | |
2024 | ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference | Singular Value Decomposition | arXiv | Link | Link |
2024 | Palu: Compressing KV-Cache with Low-Rank Projection | Singular Value Decomposition | arXiv | Link | Link |
Tensor Decomposition (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression | Tensor Decomposition | ACL | Link | Link |
Learned Low-rank Approximation (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference | Learned Low-rank Approximation | arXiv | Link | Link |
2024 | MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection | Learned Low-rank Approximation | arXiv | Link |
Intra-Layer Grouping (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2019 | Fast Transformer Decoding: One Write-Head is All You Need | Intra-Layer Grouping | arXiv | Link | |
2023 | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Intra-Layer Grouping | EMNLP | Link | Link |
2024 | Optimised Grouped-Query Attention Mechanism for Transformers | Intra-Layer Grouping | ICML | Link | |
2024 | Weighted Grouped Query Attention in Transformers | Intra-Layer Grouping | arXiv | Link | |
2024 | QCQA: Quality and Capacity-aware grouped Query Attention | Intra-Layer Grouping | arXiv | Link | Non-official Link |
2024 | Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention | Intra-Layer Grouping | arXiv | Link | Link |
2023 | GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values | Intra-Layer Grouping | NeurIPS | Link |
Cross-Layer Sharing (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Reducing Transformer Key-Value Cache Size with Cross-Layer Attention | Cross-Layer Sharing | arXiv | Link | Non-official Link |
2024 | Layer-Condensed KV Cache for Efficient Inference of Large Language Models | Cross-Layer Sharing | ACL | Link | Link |
2024 | Beyond KV Caching: Shared Attention for Efficient LLMs | Cross-Layer Sharing | arXiv | Link | Link |
2024 | MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding | Cross-Layer Sharing | arXiv | Link | Link |
2024 | Cross-layer Attention Sharing for Large Language Models | Cross-Layer Sharing | arXiv | Link | |
2024 | A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference | Cross-Layer Sharing | arXiv | Link | |
2024 | Lossless KV Cache Compression to 2% | Cross-Layer Sharing | arXiv | Link | |
2024 | DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion | Cross-Layer Sharing | NeurIPS | Link | |
2024 | Value Residual Learning For Alleviating Attention Concentration In Transformers | Cross-Layer Sharing | arXiv | Link | Link |
Enhanced Attention (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | Enhanced Attention | arXiv | Link | Link |
2022 | Transformer Quality in Linear Time | Enhanced Attention | ICML | Link | |
2024 | Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention | Enhanced Attention | arXiv | Link |
Augmented Architecture (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | You Only Cache Once: Decoder-Decoder Architectures for Language Models | Augmented Architecture | arXiv | Link | Link |
2024 | Long-Context Language Modeling with Parallel Context Encoding | Augmented Architectures | ACL | Link | Link |
2024 | XC-CACHE: Cross-Attending to Cached Context for Efficient LLM Inference | Augmented Architectures | Findings | Link | |
2024 | Block Transformer: Global-to-Local Language Modeling for Fast Inference | Augmented Architectures | arXiv | Link | Link |
Adaptive Sequence Processing Architecture (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2023 | RWKV: Reinventing RNNs for the Transformer Era | Adaptive Sequence Processing Architecture | Findings | Link | Link |
2024 | Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Adaptive Sequence Processing Architecture | arXiv | Link | Link |
2023 | Retentive Network: A Successor to Transformer for Large Language Models | Adaptive Sequence Processing Architecture | arXiv | Link | Link |
2024 | MCSD: An Efficient Language Model with Diverse Fusion | Adaptive Sequence Processing Architecture | arXiv | Link |
Hybrid Architecture (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling | Hybrid Architecture | IOS Press | Link | |
2024 | GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression | Hybrid Architecture | arXiv | Link | Link |
2024 | RecurFormer: Not All Transformer Heads Need Self-Attention | Hybrid Architecture | arXiv | Link |
Architectural Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Architectural Design | arXiv | Link | Link |
2024 | Unifying KV Cache Compression for Large Language Models with LeanKV | Architectural Design | arXiv | Link | |
2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Architectural Design | SOSP | Link | Link |
Prefix-aware Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition | Prefix-aware Design | ACL | Link | Link |
2024 | MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCaching | Prefix-aware Design | arXiv | Link |
Prefix-aware Scheduling (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching | Prefix-aware Scheduling | arXiv | Link | |
2024 | SGLang: Efficient Execution of Structured Language Model Programs | Prefix-aware Scheduling | NeurIPS | Link | Link |
Preemptive and Fairness-oriented Scheduling (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Fast Distributed Inference Serving for Large Language Models | Preemptive and Fairness-oriented Scheduling | arXiv | Link | |
2024 | FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVING | Preemptive and Fairness-oriented Scheduling | arXiv | Link |
Layer-specific and Hierarchical Scheduling (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management | Layer-specific and Hierarchical Scheduling | arXiv | Link | Link |
2024 | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | Layer-specific and Hierarchical Scheduling | USENIX ATC | Link | |
2024 | ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching | Layer-specific and Hierarchical Scheduling | ISCA | Link | |
2024 | Fast Inference for Augmented Large Language Models | Layer-specific and Hierarchical Scheduling | arXiv | Link |
Single/Multi-GPU Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Hydragen: High-Throughput LLM Inference with Shared Prefixes | Single/Multi-GPU Design | arXiv | Link | Link |
2024 | DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference | Single/Multi-GPU Design | arXiv | Link | |
2024 | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | Single/Multi-GPU Design | OSDI | Link | Link |
2024 | Multi-Bin Batching for Increasing LLM Inference Throughput | Single/Multi-GPU Design | arXiv | Link | |
2024 | Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters | Single/Multi-GPU Design | arXiv | Link | Link |
2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Single/Multi-GPU Design | SOSP | Link | Link |
2022 | Orca: A Distributed Serving System for Transformer-Based Generative Models | Single/Multi-GPU Design | OSDI | Link |
I/O-based Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs | I/O-based Design | arXiv | Link | Link |
2024 | Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation | I/O-based Design | arXiv | Link | |
2024 | Fast State Restoration in LLM Serving with HCache | I/O-based Design | arXiv | Link | |
2024 | Compute Or Load KV Cache? Why Not Both? | I/O-based Design | arXiv | Link | |
2024 | FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving | I/O-based Design | arXiv | Link | |
2022 | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | I/O-based Design | NeurIPS | Link | Link |
Heterogeneous Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | Heterogeneous Design | arXiv | Link | |
2024 | FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines | Heterogeneous Design | arXiv | Link | |
2024 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Heterogeneous Design | arXiv | Link | |
2024 | InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management | Heterogeneous Design | arXiv | Link | |
2024 | Fast Distributed Inference Serving for Large Language Models | Heterogeneous Design | arXiv | Link | |
2024 | Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation | Heterogeneous Design | arXiv | Link | |
2023 | Stateful Large Language Model Serving with Pensieve | Heterogeneous Design | arXiv | Link |
SSD-based Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference | SSD-based Design | arXiv | Link | |
2023 | FlexGen: High-Throughput Generative Inference of Large Language Models | SSD-based Design | ICML | Link | Link |
Please refer to our paper for detailed information on this section.