-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to launch cacheflow without ray #51
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left some comments. BTW, could you show the latency benchmarks before and after this PR?
Latency with Ray:
Latency without Ray:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…karound Dockerfile.ubi: remove vllm-nccl workaround
Summary: In Streaming mode, the vllm server returns responses as soon as a token is available. However, it doesn't do it in parts, instead, each response is already an aggregate of all the previous responses. Therefore, it is sufficient to record just the last response. Test: Manual testing Co-authored-by: Varun <varun@neuralmagic.com>
…tch-1 Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…"
Fix #23