How do I combine VLLM with flash Attention-based llama #2784

alex1996-ljl · 2024-02-06T09:34:03Z

alex1996-ljl
Feb 6, 2024

How to combine VLLM and llama based on flash attention? My current application is llama model based on flash attention, but I want to improve efficiency through vllm. Do you have any scheme that can effectively combine the two?

robertgshaw2-redhat · 2024-02-07T18:07:45Z

robertgshaw2-redhat
Feb 7, 2024
Collaborator Sponsor

vllm uses flash attention (+ many other optimizations for inference). There is nothing you have to do to enable these, should work out of the box

1 reply

alex1996-ljl Feb 7, 2024
Author

But isn't the core technology of VLLM paged attention? If I intend to use flash attention as my primary attention mechanism, will it conflict with the structure of VLLM? Or can I just implement a new layer of flash attention that is completely usable? (I'm a newbie and don't know much about these things.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I combine VLLM with flash Attention-based llama #2784

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How do I combine VLLM with flash Attention-based llama #2784

alex1996-ljl Feb 6, 2024

Replies: 1 comment · 1 reply

robertgshaw2-redhat Feb 7, 2024 Collaborator Sponsor

alex1996-ljl Feb 7, 2024 Author

alex1996-ljl
Feb 6, 2024

Replies: 1 comment 1 reply

robertgshaw2-redhat
Feb 7, 2024
Collaborator Sponsor

alex1996-ljl Feb 7, 2024
Author