Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FP8 E4M3 KV Cache #2786

Merged
merged 16 commits into from
Jan 13, 2025
Merged

Support FP8 E4M3 KV Cache #2786

merged 16 commits into from
Jan 13, 2025

Conversation

bjmsong
Copy link
Collaborator

@bjmsong bjmsong commented Jan 8, 2025

This PR adds support for FP8 E4M3 KV Cache, adapted from vllm

Usage

python examples/runtime/engine/offline_batch_inference.py \
--model=${Llama-2-7b-chat-hf} \
--kv-cache-dtype=fp8_e4m3 \
--quantization-param-path=${sglang/test/srt/kv_cache_scales_llama2_7b_chat.json}

===============================
Prompt: Hello, my name is
Generated text:  Mary. I'm a mom of two and a wife of 15 years. I'm here to share my story about how I was able to heal from a toxic relationship and start living a healthier, happier life.

It all started when I was in my early 20s. I met this guy, let's call him John, and we hit it off immediately. He was charming, handsome, and seemed to be the perfect partner. However, over time, I started to notice that he was controlling and manipulative. He would make me feel guilty for sp
===============================
Prompt: The president of the United States is
Generated text:  the leader of the executive branch of the federal government and is one of the most powerful political figures in the world. The president is elected by the people through the Electoral College and serves a four-year term. The president's duties include:

1. Serving as the Commander-in-Chief of the armed forces
2. Nominating and, with the advice and consent of the Senate, appointing federal judges, including Supreme Court justices
3. Signing or vetoing bills passed by Congress
4. Conducting foreign policy and negotiating treaties on behalf of the
===============================
Prompt: The capital of France is
Generated text:  Paris. Located in the Île-de-France region, Paris is known for its stunning architecture, world-class museums, and vibrant cultural scene. From the iconic Eiffel Tower to the Notre-Dame Cathedral, there are countless landmarks and attractions to explore in the city. The city is also famous for its fashion, cuisine, and art, making it a top destination for tourists and travelers.

Here are some of the top attractions and experiences to enjoy in Paris:

1. Eiffel Tower: This iconic tower is one
===============================
Prompt: The future of AI is
Generated text:  both exciting and unsettling. Here are some potential implications of AI on society, both positive and negative:

Positive implications:

1. Improved productivity: AI can automate many routine and repetitive tasks, freeing up time for more creative and strategic work.
2. Enhanced decision-making: AI can analyze vast amounts of data and provide insights that can inform better decision-making in various industries, such as healthcare, finance, and education.
3. Improved customer experience: AI-powered chatbots
## Checklist
  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

test/srt/kv_cache_scales_llama2_7b_chat.json Outdated Show resolved Hide resolved
python/sglang/srt/server_args.py Outdated Show resolved Hide resolved
@bjmsong bjmsong force-pushed the e4m3_kvcache branch 2 times, most recently from 93e3980 to 399c1bd Compare January 11, 2025 07:08
@merrymercy merrymercy merged commit 0bb0f76 into sgl-project:main Jan 13, 2025
15 checks passed
@zhyncs
Copy link
Member

zhyncs commented Jan 14, 2025

@bjmsong @merrymercy ^
I think the FP8 E4M3 static KV cache is not user-friendly. That's why @ispobock hasn't adopted it before, we should support the FP8 E4M3 online KV cache instead. cc @sleepcoo

@zhyncs
Copy link
Member

zhyncs commented Jan 14, 2025

ref InternLM/lmdeploy#1377

@ispobock
Copy link
Collaborator

ispobock commented Jan 14, 2025

@bjmsong Did you see performance improvement for this change? Could you share some benchmark results?

@bjmsong
Copy link
Collaborator Author

bjmsong commented Jan 15, 2025

For Qwen2.5-1.5B-Instruct, the improvement is visible.

  • each experiment runs twice
KV Cache mmlu mgsm_en
FP16 0.531/0.562 0.336/0.348
FP8-E5M2 0.016/0.031 0.016/0.012
FP8-E4M3 0.562/0.438 0.028/0.020

For Meta-Llama-3-8B-Instruct, the improvement is not so visible, due to the fact that FP8-E5M2 gives decent accuracy.

KV Cache mmlu mgsm_en
FP16 0.656/0.656 0.808/0.812
FP8-E5M2 0.656/0.688 0.756/0.768
FP8-E4M3 0.703/0.688 0.788/0.800

Reproduce:

python test/srt/test_fp8_kvcache.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants