Optimize speculative decoding PVC memory usage #10329

cyita · 2024-03-05T08:56:17Z

Description

Optimize speculative decoding PVC memory usage.

3. Summary of the change

Call empty_cache when necessary.
Fix the draft&verify model repetitive extend kv cache issue to save memory.

4. How to test?

local test (llama, mistral, gptj, baichuan, qwen, chatglm)

rnwang04

LGTM

* optimize memory * update * update * update * support other models * update * fix style

cyita added 6 commits March 4, 2024 20:57

optimize memory

99cbe1d

update

daab8fd

update

79292e6

update

8940052

support other models

d043be2

update

60f97a8

cyita requested a review from rnwang04 March 5, 2024 09:17

rnwang04 approved these changes Mar 5, 2024

View reviewed changes

fix style

ed9f60b

cyita merged commit 786254a into intel:main Mar 6, 2024
19 checks passed

liu-shaojun pushed a commit that referenced this pull request Mar 25, 2024

Optimize speculative decoding PVC memory usage (#10329)

9ea499c

* optimize memory * update * update * update * support other models * update * fix style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize speculative decoding PVC memory usage #10329

Optimize speculative decoding PVC memory usage #10329

cyita commented Mar 5, 2024 •

edited

Loading

rnwang04 left a comment

Optimize speculative decoding PVC memory usage #10329

Optimize speculative decoding PVC memory usage #10329

Conversation

cyita commented Mar 5, 2024 • edited Loading

Description

3. Summary of the change

4. How to test?

rnwang04 left a comment

Choose a reason for hiding this comment

cyita commented Mar 5, 2024 •

edited

Loading