📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc. 🎉🎉
-
Updated
Apr 12, 2025 - Python
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc. 🎉🎉
Light-field imaging application for plenoptic cameras
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Light field geometry estimator for plenoptic cameras
xKV: Cross-Layer SVD for KV-Cache Compression
An efficient and scalable attention module designed to reduce memory usage and improve inference speed in large language models. Designed and implemented the Multi-Head Latent Attention (MLA) module as a drop-in replacement for traditional multi-head attention (MHA) in large language models.
Latentformer is a transformer model with latent attention designed for efficient training. It features learnable positional embeddings, rotary position encoding, and MLA to optimize speed and performance while maintaining model quality.
最小Transformer架构,能够快速搭建现在各种Transformer架构模型
Add a description, image, and links to the mla topic page so that developers can more easily learn about it.
To associate your repository with the mla topic, visit your repo's landing page and select "manage topics."