-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve the discrepancy of latency report between LLMs and non-LLMs #8576
Comments
Android benchmark app doesn’t save inference latency for LLM https://github.com/pytorch/executorch/blob/main/extension/benchmark/android/benchmark/app/src/main/java/org/pytorch/minibench/LlmBenchmarkActivity.java#L95-L112 while non-LLM does @kirklandsign Can we add raw inference latency for LLM? It will be useful in the case to detect whether the slowness is from core runtime or the tokenizer itself. |
For iOS app, upon checking executorch/extension/benchmark/apple/Benchmark/Tests/GenericTests.mm Lines 48 to 94 in 00c1443
avg_inference_latency on the dash.
generate_time on the dash.
What each test is measuring is pretty clear, @huydhn I guess the remaining bit is why isn't the |
For iOS case, I think it's a bug:
I could push a fix for this, or should we wait until @shoumikhin is back to confirm? I'm trying to remember why we implemented it this way. |
forward tests run forward and measure latency on any model. generate tests measure tokens per second specifically, leveraging the llama runner to predict the next token several times consecutively, and the runner eventually calls forward under the hood each time. @huydhn @guangy10 let me know if you need any further details. |
@huydhn OK. I think we should report avg_inference_time for nay model. @shoumikhin I think we don't need to report both if
|
Due to pytorch/executorch#8576 (comment) As we cannot go back and update historical data, we could hide `generate_time(ms)` for a week or two till there are new data. Maybe it could also be hidden permanently if we decide to keep only TPS metric. ### Preview https://torchci-git-fork-huydhn-hide-generate-time-0f0c4f-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch
🐛 Describe the bug
As shown on the dashboard, the
avg_inference_latency (ms)
is skipped for LLM, and report onlygenerate_time (ms)
instead.Upon checking the iOS run for example, a LLM job will run three tests on-device to report different metrics:
test_load_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
test_forward_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
test_generate_llama_3_2_1b_llama3_fb16_pte_tokenizer_model_iOS_17_2_1_iPhone15_4
While a non-LLM job will only run the first two tests (test_load_ and test_forward_ ) instead.
See detailed jobs here:
Three things to get clarification in this task:
test_forward_*
is reported to both LLM and non-LLM, why isn't reported to the dash?3. Confirm if Android is measuring and reporting exact same metricsReport avg_inference_latency from Android benchmark app #8578Versions
trunk
cc @huydhn @kirklandsign @shoumikhin @mergennachin @byjlw
The text was updated successfully, but these errors were encountered: