Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Add "tokens per second" information in the backend #10548

Merged
merged 12 commits into from
Dec 2, 2024

Conversation

lhpqaq
Copy link
Contributor

@lhpqaq lhpqaq commented Nov 27, 2024

Implement #10502

image image

@ngxson
Copy link
Collaborator

ngxson commented Nov 27, 2024

The idea is good but I'm not confident about the UI/UX part:

  1. Not all users want this, so it must be hidden by default (for a clean UI) and user can activate it via Settings menu

  2. The text takes up quite a lot of space, I would prefer to make it more subtle. Take jan.ai app as an example:
    image

  3. For the code, we can calculate these numbers in real-time, on the frontend. This provides a better UX and allow to show t/s speed in real time.

@lhpqaq
Copy link
Contributor Author

lhpqaq commented Nov 28, 2024

@ngxson Thank you for your suggestion. I’m also not very confident about the UI/UX part.
I have restored the frontend section and added a real-time speed field in the backend. I hope it proves useful.

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" assist"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.860551225775627}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.295321955588292}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" today"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.80692518481443}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.234142607517185}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":1561,"total_tokens":1571,"gen_second":28.003281984648602,"prompt_second":479.472377660826}}

data: [DONE]

@lhpqaq lhpqaq changed the title server: Add "tokens per second" information in the Web UI server: Add "tokens per second" information in the backend Nov 28, 2024
@ngxson
Copy link
Collaborator

ngxson commented Nov 28, 2024

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

@lhpqaq
Copy link
Contributor Author

lhpqaq commented Nov 28, 2024

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

Yes, but get_formated_timings() only calculates the final result and lacks real-time speed during the prediction process.
image

@ngxson
Copy link
Collaborator

ngxson commented Nov 28, 2024

It doesn't get the correct value because slot.t_token_generation is not set during generation. You can simply set it.

What I'm thinking is:

  • This feature should not enabled by default because it can potentially impact overall performance and network bandwidth. You should only return per-token timing if user ask for it (i.e. if user set "timing_per_token": true in the request)
  • It's better and more intuitive to reuse "timings" object, which is provided by get_formated_timings(), so developers don't have to re-write their code or to memorize the difference between n_gen_second vs timings
  • Also remember to update the documentation

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be simplified further.

To pass the CI, you need to merge with latest upstream master branch

examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/utils.hpp Outdated Show resolved Hide resolved
examples/server/utils.hpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added the python python script changes label Dec 2, 2024
@ngxson ngxson merged commit 64ed209 into ggerganov:master Dec 2, 2024
52 checks passed
@lhpqaq
Copy link
Contributor Author

lhpqaq commented Dec 2, 2024

@ngxson Thanks~

@lhpqaq lhpqaq deleted the token branch December 4, 2024 07:49
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
…#10548)

* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…#10548)

* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants