-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : fix defrag logic #11707
llama : fix defrag logic #11707
Conversation
ggml-ci
ggml-ci
if (cparams.causal_attn && cparams.defrag_thold >= 0.0f) { | ||
const float fragmentation = kv_self.n >= 128 ? 1.0f - float(kv_self.used)/float(kv_self.n) : 0.0f; | ||
if (cparams.causal_attn && cparams.defrag_thold > 0.0f) { | ||
// - do not defrag small contexts (i.e. < 2048 tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov I am sometimes running benchmarks that require only 256 or 512 tokens per slot, with total context size like 512 or 1024 (for big models that don't fully fit into my VRAM). Will it work properly in cases like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defragmentation for such small context is not really worth it, so my expectation is that with this change you should get better performance overall.
* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci
* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci
* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci
While working on #11213 I realized that we are currently doing many unnecessary graph defrags because of incorrect cache fragmentation logic. The cache padding triggers the fragmentation threshold for small contexts even if there is no fragmentation at all.
master
has the following path applied: