Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Use zeromq to put request output tokens back to the api server #28

Merged
merged 40 commits into from
Sep 10, 2024

Conversation

s5u13b
Copy link
Contributor

@s5u13b s5u13b commented Sep 4, 2024

Previously, we use ray.queue as the request output queue to store the streaming request output tokens returned by llm engine. However, we found that the cost of ray rpc will become very high when the frequency of remote call is high.

Because the request output queue of a api server could be written by multiple llm engines due to dynamic serving of Llumnix, it means that the frequency of streaming output rpc can be very high. So we decide to use zeromq as the default rpc way of putting streaming request output tokens back to the request output queue.

Besides, rpc(tcp://...) has no more obvious performance interference with step than ipc(ipc://...).

@s5u13b s5u13b changed the title [Core] Use zeromq to put request outputs back to the api server [Core] Use zeromq to put request output tokens back to the api server Sep 5, 2024
@s5u13b s5u13b merged commit c6ac5db into main Sep 10, 2024
4 checks passed
@s5u13b s5u13b deleted the zeromq branch October 17, 2024 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants