Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

Open
guangy10 opened this issue Feb 19, 2025 · 5 comments
Open

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

guangy10 opened this issue Feb 19, 2025 · 5 comments
Assignees
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure module: user experience Issues related to reducing friction for users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@guangy10
Copy link
Contributor

guangy10 commented Feb 19, 2025

🐛 Describe the bug

Image

As shown on the dashboard, the avg_inference_latency (ms) is skipped for LLM, and report only generate_time (ms) instead.

Upon checking the iOS run for example, a LLM job will run three tests on-device to report different metrics:

  1. test_load_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
  2. test_forward_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
  3. test_generate_llama_3_2_1b_llama3_fb16_pte_tokenizer_model_iOS_17_2_1_iPhone15_4
    While a non-LLM job will only run the first two tests (test_load_ and test_forward_ ) instead.

See detailed jobs here:

Three things to get clarification in this task:

  1. Because test_forward_* is reported to both LLM and non-LLM, why isn't reported to the dash?
  2. Let's annotate each metrics in the DB so users will know what exactly is measured by each.
    3. Confirm if Android is measuring and reporting exact same metrics Report avg_inference_latency from Android benchmark app #8578

Versions

trunk

cc @huydhn @kirklandsign @shoumikhin @mergennachin @byjlw

@guangy10 guangy10 added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure labels Feb 19, 2025
@guangy10 guangy10 added this to the 0.6.0 milestone Feb 19, 2025
@guangy10 guangy10 moved this to Ready in ExecuTorch Benchmark Feb 19, 2025
@guangy10 guangy10 added the module: user experience Issues related to reducing friction for users label Feb 19, 2025
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch DevX Feb 19, 2025
@guangy10
Copy link
Contributor Author

Android benchmark app doesn’t save inference latency for LLM https://github.com/pytorch/executorch/blob/main/extension/benchmark/android/benchmark/app/src/main/java/org/pytorch/minibench/LlmBenchmarkActivity.java#L95-L112 while non-LLM does

@kirklandsign Can we add raw inference latency for LLM? It will be useful in the case to detect whether the slowness is from core runtime or the tokenizer itself.

@guangy10
Copy link
Contributor Author

For iOS app, upon checking test_forward_ is measuring the raw latency around forward(), for both LLM and non-LLM:

return @{
@"load" : ^(XCTestCase *testCase){
[testCase
measureWithMetrics:@[ [XCTClockMetric new], [XCTMemoryMetric new] ]
block:^{
XCTAssertEqual(
Module(modelPath.UTF8String).load_forward(),
Error::Ok);
}];
},
@"forward" : ^(XCTestCase *testCase) {
auto __block module = std::make_unique<Module>(modelPath.UTF8String);
const auto method_meta = module->method_meta("forward");
ASSERT_OK_OR_RETURN(method_meta);
const auto num_inputs = method_meta->num_inputs();
XCTAssertGreaterThan(num_inputs, 0);
std::vector<TensorPtr> tensors;
tensors.reserve(num_inputs);
for (auto index = 0; index < num_inputs; ++index) {
const auto input_tag = method_meta->input_tag(index);
ASSERT_OK_OR_RETURN(input_tag);
switch (*input_tag) {
case Tag::Tensor: {
const auto tensor_meta = method_meta->input_tensor_meta(index);
ASSERT_OK_OR_RETURN(tensor_meta);
const auto sizes = tensor_meta->sizes();
tensors.emplace_back(
rand({sizes.begin(), sizes.end()}, tensor_meta->scalar_type()));
XCTAssertEqual(module->set_input(tensors.back(), index), Error::Ok);
} break;
default:
XCTFail("Unsupported tag %i at input %d", *input_tag, index);
}
}
XCTMeasureOptions *options = [[XCTMeasureOptions alloc] init];
options.iterationCount = 20;
[testCase measureWithMetrics:@[ [XCTClockMetric new], [XCTMemoryMetric new] ]
options:options
block:^{
XCTAssertEqual(module->forward().error(), Error::Ok);
}];
. This metric is mapped to the avg_inference_latency on the dash.

test_generate_ is LLM specific, measuring the latency around generate() (forward + tokenization):

@"generate" : ^(XCTestCase *testCase){
auto __block runner = std::make_unique<example::Runner>(
modelPath.UTF8String, tokenizerPath.UTF8String);
const auto status = runner->load();
if (status != Error::Ok) {
XCTFail("Load failed with error %i", status);
return;
}
TokensPerSecondMetric *tokensPerSecondMetric = [TokensPerSecondMetric new];
[testCase measureWithMetrics:@[ tokensPerSecondMetric, [XCTMemoryMetric new] ]
block:^{
tokensPerSecondMetric.tokenCount = 0;
const auto status = runner->generate(
"Once upon a time",
50,
[=](const std::string &token) {
tokensPerSecondMetric.tokenCount++;
},
nullptr,
false);
XCTAssertEqual(status, Error::Ok);
}];
. This metric is mapped to the generate_time on the dash.

What each test is measuring is pretty clear, @huydhn I guess the remaining bit is why isn't the avg_inference_latency reported to the dash?

@huydhn
Copy link
Contributor

huydhn commented Feb 19, 2025

For iOS case, I think it's a bug:

I could push a fix for this, or should we wait until @shoumikhin is back to confirm? I'm trying to remember why we implemented it this way.

@shoumikhin
Copy link
Contributor

forward tests run forward and measure latency on any model.

generate tests measure tokens per second specifically, leveraging the llama runner to predict the next token several times consecutively, and the runner eventually calls forward under the hood each time.

@huydhn @guangy10 let me know if you need any further details.

@guangy10
Copy link
Contributor Author

guangy10 commented Feb 20, 2025

forward tests run forward and measure latency on any model.

generate tests measure tokens per second specifically, leveraging the llama runner to predict the next token several times consecutively, and the runner eventually calls forward under the hood each time.

@huydhn @guangy10 let me know if you need any further details.

@huydhn OK. I think we should report avg_inference_time for nay model.

@shoumikhin I think we don't need to report both if generate_time(ms) and Tokens per second are essentially measuring the same thing, i.e. TPS = 1000 / generate_time, correct? If we look at the 1st model in the screenshot, shouldn't its TPS be 1000/18=55.6 instead of 45? Checked the code it seems like tps is measured slightly different:

double elapsedTime =
(endTime.absoluteTimeNanoSeconds - startTime.absoluteTimeNanoSeconds) /
(double)NSEC_PER_SEC;
return @[ [[XCTPerformanceMeasurement alloc]
initWithIdentifier:NSStringFromClass([self class])
displayName:@"Tokens Per Second"
doubleValue:(self.tokenCount / elapsedTime)
unitSymbol:@"t/s"] ];

huydhn added a commit to huydhn/test-infra that referenced this issue Feb 21, 2025
@swolchok swolchok added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 21, 2025
huydhn added a commit to pytorch/test-infra that referenced this issue Feb 22, 2025
Due to
pytorch/executorch#8576 (comment)

As we cannot go back and update historical data, we could hide
`generate_time(ms)` for a week or two till there are new data. Maybe it
could also be hidden permanently if we decide to keep only TPS metric.

### Preview


https://torchci-git-fork-huydhn-hide-generate-time-0f0c4f-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure module: user experience Issues related to reducing friction for users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Ready
Status: To triage
Development

No branches or pull requests

6 participants