server: Add "tokens per second" information in the backend #10548

lhpqaq · 2024-11-27T15:36:41Z

Implement #10502

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ngxson · 2024-11-27T18:25:47Z

The idea is good but I'm not confident about the UI/UX part:

Not all users want this, so it must be hidden by default (for a clean UI) and user can activate it via Settings menu
The text takes up quite a lot of space, I would prefer to make it more subtle. Take jan.ai app as an example:
For the code, we can calculate these numbers in real-time, on the frontend. This provides a better UX and allow to show t/s speed in real time.

lhpqaq · 2024-11-28T02:53:16Z

@ngxson Thank you for your suggestion. I’m also not very confident about the UI/UX part.
I have restored the frontend section and added a real-time speed field in the backend. I hope it proves useful.

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" assist"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.860551225775627}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":29.295321955588292}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" today"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.80692518481443}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","gen_second":28.234142607517185}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1732762062,"id":"chatcmpl-evuFEOOBA3wfqsmOq1Yvxz9S6X37BdLu","model":"gpt-3.5-turbo-0613","object":"chat.completion.chunk","usage":{"completion_tokens":10,"prompt_tokens":1561,"total_tokens":1571,"gen_second":28.003281984648602,"prompt_second":479.472377660826}}

data: [DONE]

ngxson · 2024-11-28T11:02:48Z

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

lhpqaq · 2024-11-28T12:58:50Z

I haven't had time to look deeper into this, but seems like what you're doing is already handled by get_formated_timings(). Can you have a look if it's a duplication?

Yes, but get_formated_timings() only calculates the final result and lacks real-time speed during the prediction process.

ngxson · 2024-11-28T18:22:56Z

It doesn't get the correct value because slot.t_token_generation is not set during generation. You can simply set it.

What I'm thinking is:

This feature should not enabled by default because it can potentially impact overall performance and network bandwidth. You should only return per-token timing if user ask for it (i.e. if user set "timing_per_token": true in the request)
It's better and more intuitive to reuse "timings" object, which is provided by get_formated_timings(), so developers don't have to re-write their code or to memorize the difference between n_gen_second vs timings
Also remember to update the documentation

ngxson

This code can be simplified further.

To pass the CI, you need to merge with latest upstream master branch

examples/server/server.cpp

examples/server/utils.hpp

lhpqaq · 2024-12-02T13:48:35Z

@ngxson Thanks~

…#10548) * add cmake rvv support * add timings * remove space * update readme * fix * fix code * remove empty line * add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

lhpqaq added 4 commits November 19, 2024 23:52

add cmake rvv support

4278480

Merge branch 'ggerganov:master' into master

3f6406f

Merge branch 'ggerganov:master' into master

daab141

Merge branch 'ggerganov:master' into master

2c96bd2

github-actions bot added examples server labels Nov 27, 2024

lhpqaq force-pushed the token branch from 1255827 to 44f5474 Compare November 28, 2024 02:38

lhpqaq changed the title ~~server: Add "tokens per second" information in the Web UI~~ server: Add "tokens per second" information in the backend Nov 28, 2024

add timings

fb10521

lhpqaq force-pushed the token branch from b4ff50b to fb10521 Compare November 29, 2024 02:21

lhpqaq added 3 commits November 29, 2024 10:28

remove space

f766ae1

update readme

c98e9a7

fix

ce32516

ngxson requested changes Nov 30, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/utils.hpp Outdated Show resolved Hide resolved

examples/server/utils.hpp Outdated Show resolved Hide resolved

lhpqaq added 3 commits November 30, 2024 17:50

Merge branch 'ggerganov:master' into token

c47c41c

fix code

21f8b73

remove empty line

1b301db

lhpqaq requested a review from ngxson November 30, 2024 16:39

slaren mentioned this pull request Dec 1, 2024

Feature Request: Add "tokens per second" information in the Web UI #10502

Closed

4 tasks

add test

28d8c91

github-actions bot added the python python script changes label Dec 2, 2024

ngxson approved these changes Dec 2, 2024

View reviewed changes

ngxson merged commit 64ed209 into ggerganov:master Dec 2, 2024
52 checks passed

lhpqaq deleted the token branch December 4, 2024 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: Add "tokens per second" information in the backend #10548

server: Add "tokens per second" information in the backend #10548

lhpqaq commented Nov 27, 2024 •

edited

Loading

ngxson commented Nov 27, 2024 •

edited

Loading

lhpqaq commented Nov 28, 2024

ngxson commented Nov 28, 2024

lhpqaq commented Nov 28, 2024 •

edited

Loading

ngxson commented Nov 28, 2024

ngxson left a comment

lhpqaq commented Dec 2, 2024

server: Add "tokens per second" information in the backend #10548

server: Add "tokens per second" information in the backend #10548

Conversation

lhpqaq commented Nov 27, 2024 • edited Loading

ngxson commented Nov 27, 2024 • edited Loading

lhpqaq commented Nov 28, 2024

ngxson commented Nov 28, 2024

lhpqaq commented Nov 28, 2024 • edited Loading

ngxson commented Nov 28, 2024

ngxson left a comment

Choose a reason for hiding this comment

lhpqaq commented Dec 2, 2024

lhpqaq commented Nov 27, 2024 •

edited

Loading

ngxson commented Nov 27, 2024 •

edited

Loading

lhpqaq commented Nov 28, 2024 •

edited

Loading