How do I combine VLLM with flash Attention-based llama #2784
Unanswered
alex1996-ljl
asked this question in
Q&A
Replies: 1 comment 1 reply
-
vllm uses flash attention (+ many other optimizations for inference). There is nothing you have to do to enable these, should work out of the box |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How to combine VLLM and llama based on flash attention? My current application is llama model based on flash attention, but I want to improve efficiency through vllm. Do you have any scheme that can effectively combine the two?
Beta Was this translation helpful? Give feedback.
All reactions