Support FP8 E4M3 KV Cache #2786

bjmsong · 2025-01-08T08:16:05Z

This PR adds support for FP8 E4M3 KV Cache, adapted from vllm

Usage

python examples/runtime/engine/offline_batch_inference.py \
--model=${Llama-2-7b-chat-hf} \
--kv-cache-dtype=fp8_e4m3 \
--quantization-param-path=${sglang/test/srt/kv_cache_scales_llama2_7b_chat.json}

===============================
Prompt: Hello, my name is
Generated text:  Mary. I'm a mom of two and a wife of 15 years. I'm here to share my story about how I was able to heal from a toxic relationship and start living a healthier, happier life.

It all started when I was in my early 20s. I met this guy, let's call him John, and we hit it off immediately. He was charming, handsome, and seemed to be the perfect partner. However, over time, I started to notice that he was controlling and manipulative. He would make me feel guilty for sp
===============================
Prompt: The president of the United States is
Generated text:  the leader of the executive branch of the federal government and is one of the most powerful political figures in the world. The president is elected by the people through the Electoral College and serves a four-year term. The president's duties include:

1. Serving as the Commander-in-Chief of the armed forces
2. Nominating and, with the advice and consent of the Senate, appointing federal judges, including Supreme Court justices
3. Signing or vetoing bills passed by Congress
4. Conducting foreign policy and negotiating treaties on behalf of the
===============================
Prompt: The capital of France is
Generated text:  Paris. Located in the Île-de-France region, Paris is known for its stunning architecture, world-class museums, and vibrant cultural scene. From the iconic Eiffel Tower to the Notre-Dame Cathedral, there are countless landmarks and attractions to explore in the city. The city is also famous for its fashion, cuisine, and art, making it a top destination for tourists and travelers.

Here are some of the top attractions and experiences to enjoy in Paris:

1. Eiffel Tower: This iconic tower is one
===============================
Prompt: The future of AI is
Generated text:  both exciting and unsettling. Here are some potential implications of AI on society, both positive and negative:

Positive implications:

1. Improved productivity: AI can automate many routine and repetitive tasks, freeing up time for more creative and strategic work.
2. Enhanced decision-making: AI can analyze vast amounts of data and provide insights that can inform better decision-making in various industries, such as healthcare, finance, and education.
3. Improved customer experience: AI-powered chatbots

## Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

… e4m3_kvcache

This reverts commit 82bce3b.

test/srt/kv_cache_scales_llama2_7b_chat.json

python/sglang/srt/server_args.py

… e4m3_kvcache

zhyncs · 2025-01-14T05:25:28Z

@bjmsong @merrymercy ^
I think the FP8 E4M3 static KV cache is not user-friendly. That's why @ispobock hasn't adopted it before, we should support the FP8 E4M3 online KV cache instead. cc @sleepcoo

zhyncs · 2025-01-14T05:42:18Z

ref InternLM/lmdeploy#1377

ispobock · 2025-01-14T09:53:42Z

@bjmsong Did you see performance improvement for this change? Could you share some benchmark results?

bjmsong · 2025-01-15T00:28:03Z

For Qwen2.5-1.5B-Instruct, the improvement is visible.

each experiment runs twice

KV Cache	mmlu	mgsm_en
FP16	0.531/0.562	0.336/0.348
FP8-E5M2	0.016/0.031	0.016/0.012
FP8-E4M3	0.562/0.438	0.028/0.020

For Meta-Llama-3-8B-Instruct, the improvement is not so visible, due to the fact that FP8-E5M2 gives decent accuracy.

KV Cache	mmlu	mgsm_en
FP16	0.656/0.656	0.808/0.812
FP8-E5M2	0.656/0.688	0.756/0.768
FP8-E4M3	0.703/0.688	0.788/0.800

Reproduce:

python test/srt/test_fp8_kvcache.py

mdattack added 3 commits January 8, 2025 13:23

format

d2173d5

Merge branch 'e4m3_kvcache' of https://github.com/bjmsong/sglang into…

d038ca5

… e4m3_kvcache

debug

4b8c615

bjmsong requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners January 8, 2025 08:16

bjmsong and others added 9 commits January 8, 2025 17:06

Merge branch 'main' into e4m3_kvcache

62bf26b

debug

82bce3b

debug

5b9beb9

Revert "debug"

01c564c

This reverts commit 82bce3b.

debug

6559523

debug

535f9ac

debug

cbcc95a

minor

eab05e1

Merge branch 'main' into e4m3_kvcache

843ba06

merrymercy requested changes Jan 10, 2025

View reviewed changes

test/srt/kv_cache_scales_llama2_7b_chat.json Outdated Show resolved Hide resolved

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

bjmsong force-pushed the e4m3_kvcache branch 2 times, most recently from 93e3980 to 399c1bd Compare January 11, 2025 07:08

add unit test

a82dcbc

bjmsong force-pushed the e4m3_kvcache branch from 3ce9d68 to a82dcbc Compare January 11, 2025 08:50

bjmsong and others added 2 commits January 11, 2025 17:06

Merge branch 'main' into e4m3_kvcache

e6ca7a4

minor

ccf6a3e

bjmsong force-pushed the e4m3_kvcache branch from 93e899b to e68260c Compare January 11, 2025 10:02

Merge branch 'e4m3_kvcache' of https://github.com/bjmsong/sglang into…

36fdb1f

… e4m3_kvcache

bjmsong force-pushed the e4m3_kvcache branch from e68260c to 36fdb1f Compare January 11, 2025 10:28

merrymercy merged commit 0bb0f76 into sgl-project:main Jan 13, 2025
15 checks passed

bjmsong mentioned this pull request Jan 15, 2025

support e4m3 kvcache in qwen2 & add kv scaling facotr json #2894

Merged

3 tasks

shadowpa0327 mentioned this pull request Jan 17, 2025

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Open

2 tasks

merrymercy mentioned this pull request Jan 20, 2025

Fix perf regression on small batch sizes due to kv cache scale #3008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP8 E4M3 KV Cache #2786

Support FP8 E4M3 KV Cache #2786

bjmsong commented Jan 8, 2025

zhyncs commented Jan 14, 2025

zhyncs commented Jan 14, 2025

ispobock commented Jan 14, 2025 •

edited

Loading

bjmsong commented Jan 15, 2025 •

edited

Loading

Support FP8 E4M3 KV Cache #2786

Support FP8 E4M3 KV Cache #2786

Conversation

bjmsong commented Jan 8, 2025

Usage

zhyncs commented Jan 14, 2025

zhyncs commented Jan 14, 2025

ispobock commented Jan 14, 2025 • edited Loading

bjmsong commented Jan 15, 2025 • edited Loading

ispobock commented Jan 14, 2025 •

edited

Loading

bjmsong commented Jan 15, 2025 •

edited

Loading