Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

guangy10 · 2025-02-19T19:35:20Z

🐛 Describe the bug

As shown on the dashboard, the avg_inference_latency (ms) is skipped for LLM, and report only generate_time (ms) instead.

Upon checking the iOS run for example, a LLM job will run three tests on-device to report different metrics:

test_load_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
test_forward_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4
test_generate_llama_3_2_1b_llama3_fb16_pte_tokenizer_model_iOS_17_2_1_iPhone15_4
While a non-LLM job will only run the first two tests (test_load_ and test_forward_ ) instead.

See detailed jobs here:

Three things to get clarification in this task:

Because test_forward_* is reported to both LLM and non-LLM, why isn't reported to the dash?
Let's annotate each metrics in the DB so users will know what exactly is measured by each.
~~3. Confirm if Android is measuring and reporting exact same metrics~~ Report avg_inference_latency from Android benchmark app #8578

Versions

trunk

cc @huydhn @kirklandsign @shoumikhin @mergennachin @byjlw

The text was updated successfully, but these errors were encountered:

guangy10 · 2025-02-19T19:50:32Z

Android benchmark app doesn’t save inference latency for LLM https://github.com/pytorch/executorch/blob/main/extension/benchmark/android/benchmark/app/src/main/java/org/pytorch/minibench/LlmBenchmarkActivity.java#L95-L112 while non-LLM does

@kirklandsign Can we add raw inference latency for LLM? It will be useful in the case to detect whether the slowness is from core runtime or the tokenizer itself.

guangy10 · 2025-02-19T19:59:13Z

For iOS app, upon checking test_forward_ is measuring the raw latency around forward(), for both LLM and non-LLM:

executorch/extension/benchmark/apple/Benchmark/Tests/GenericTests.mm

Lines 48 to 94 in 00c1443

    
           return @{ 
        
             @"load" : ^(XCTestCase *testCase){ 
        
               [testCase 
        
                   measureWithMetrics:@[ [XCTClockMetric new], [XCTMemoryMetric new] ] 
        
                                block:^{ 
        
                                  XCTAssertEqual( 
        
                                      Module(modelPath.UTF8String).load_forward(), 
        
                                      Error::Ok); 
        
                                }]; 
        
             }, 
        
             @"forward" : ^(XCTestCase *testCase) { 
        
               auto __block module = std::make_unique<Module>(modelPath.UTF8String); 
        
               const auto method_meta = module->method_meta("forward"); 
        
               ASSERT_OK_OR_RETURN(method_meta); 
        
               const auto num_inputs = method_meta->num_inputs(); 
        
               XCTAssertGreaterThan(num_inputs, 0); 
        
               std::vector<TensorPtr> tensors; 
        
               tensors.reserve(num_inputs); 
        
               for (auto index = 0; index < num_inputs; ++index) { 
        
                 const auto input_tag = method_meta->input_tag(index); 
        
                 ASSERT_OK_OR_RETURN(input_tag); 
        
                 switch (*input_tag) { 
        
                 case Tag::Tensor: { 
        
                   const auto tensor_meta = method_meta->input_tensor_meta(index); 
        
                   ASSERT_OK_OR_RETURN(tensor_meta); 
        
                   const auto sizes = tensor_meta->sizes(); 
        
                   tensors.emplace_back( 
        
                       rand({sizes.begin(), sizes.end()}, tensor_meta->scalar_type())); 
        
                   XCTAssertEqual(module->set_input(tensors.back(), index), Error::Ok); 
        
                 } break; 
        
                 default: 
        
                   XCTFail("Unsupported tag %i at input %d", *input_tag, index); 
        
                 } 
        
               } 
        
               XCTMeasureOptions *options = [[XCTMeasureOptions alloc] init]; 
        
               options.iterationCount = 20; 
        
               [testCase measureWithMetrics:@[ [XCTClockMetric new], [XCTMemoryMetric new] ] 
        
                                     options:options 
        
                                     block:^{ 
        
                                       XCTAssertEqual(module->forward().error(), Error::Ok); 
        
                                     }];

. This metric is mapped to the avg_inference_latency on the dash.

test_generate_ is LLM specific, measuring the latency around generate() (forward + tokenization):

executorch/extension/benchmark/apple/Benchmark/Tests/LLaMA/LLaMATests.mm

Lines 76 to 97 in 00c1443

    
           @"generate" : ^(XCTestCase *testCase){ 
        
             auto __block runner = std::make_unique<example::Runner>( 
        
                 modelPath.UTF8String, tokenizerPath.UTF8String); 
        
             const auto status = runner->load(); 
        
             if (status != Error::Ok) { 
        
               XCTFail("Load failed with error %i", status); 
        
               return; 
        
             } 
        
             TokensPerSecondMetric *tokensPerSecondMetric = [TokensPerSecondMetric new]; 
        
             [testCase measureWithMetrics:@[ tokensPerSecondMetric, [XCTMemoryMetric new] ] 
        
                                   block:^{ 
        
                                     tokensPerSecondMetric.tokenCount = 0; 
        
                                     const auto status = runner->generate( 
        
                                         "Once upon a time", 
        
                                         50, 
        
                                         [=](const std::string &token) { 
        
                                           tokensPerSecondMetric.tokenCount++; 
        
                                         }, 
        
                                         nullptr, 
        
                                         false); 
        
                                     XCTAssertEqual(status, Error::Ok); 
        
                                   }];

. This metric is mapped to the generate_time on the dash.

What each test is measuring is pretty clear, @huydhn I guess the remaining bit is why isn't the avg_inference_latency reported to the dash?

huydhn · 2025-02-19T20:36:58Z

For iOS case, I think it's a bug:

In both llm and non-llm models, forward test measures the time https://github.com/pytorch/executorch/blob/main/extension/benchmark/apple/Benchmark/Tests/GenericTests.mm#L90. Then there is this logic https://github.com/pytorch/executorch/blob/main/.github/scripts/extract_benchmark_results.py#L231-L237 to call it avg_inference_latency(ms) for non-llm and generate_time(ms) for llm to match what Android returns.
On the other hand, generate test on llm doesn't measure the time https://github.com/pytorch/executorch/blob/main/extension/benchmark/apple/Benchmark/Tests/LLaMA/LLaMATests.mm#L85.

I could push a fix for this, or should we wait until @shoumikhin is back to confirm? I'm trying to remember why we implemented it this way.

shoumikhin · 2025-02-20T01:50:38Z

forward tests run forward and measure latency on any model.

generate tests measure tokens per second specifically, leveraging the llama runner to predict the next token several times consecutively, and the runner eventually calls forward under the hood each time.

@huydhn @guangy10 let me know if you need any further details.

guangy10 · 2025-02-20T18:47:21Z

forward tests run forward and measure latency on any model.

generate tests measure tokens per second specifically, leveraging the llama runner to predict the next token several times consecutively, and the runner eventually calls forward under the hood each time.

@huydhn @guangy10 let me know if you need any further details.

@huydhn OK. I think we should report avg_inference_time for nay model.

@shoumikhin I think we don't need to report both if generate_time(ms) and Tokens per second are essentially measuring the same thing, i.e. TPS = 1000 / generate_time, correct? If we look at the 1st model in the screenshot, shouldn't its TPS be 1000/18=55.6 instead of 45? Checked the code it seems like tps is measured slightly different:

executorch/extension/benchmark/apple/Benchmark/Tests/LLaMA/LLaMATests.mm

Lines 36 to 43 in 00c1443

    
           double elapsedTime = 
        
               (endTime.absoluteTimeNanoSeconds - startTime.absoluteTimeNanoSeconds) / 
        
               (double)NSEC_PER_SEC; 
        
           return @[ [[XCTPerformanceMeasurement alloc] 
        
               initWithIdentifier:NSStringFromClass([self class]) 
        
                      displayName:@"Tokens Per Second" 
        
                      doubleValue:(self.tokenCount / elapsedTime) 
        
                       unitSymbol:@"t/s"] ];

Due to pytorch/executorch#8576 (comment)

Due to pytorch/executorch#8576 (comment) As we cannot go back and update historical data, we could hide `generate_time(ms)` for a week or two till there are new data. Maybe it could also be hidden permanently if we decide to keep only TPS metric. ### Preview https://torchci-git-fork-huydhn-hide-generate-time-0f0c4f-fbopensource.vercel.app/benchmark/llms?repoName=pytorch%2Fexecutorch

guangy10 added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: benchmark Issues related to the benchmark infrastructure labels Feb 19, 2025

guangy10 added this to the 0.6.0 milestone Feb 19, 2025

guangy10 assigned guangy10, kirklandsign, shoumikhin, huydhn and yangw-dev Feb 19, 2025

guangy10 added this to ExecuTorch Benchmark Feb 19, 2025

guangy10 moved this to Ready in ExecuTorch Benchmark Feb 19, 2025

guangy10 added the module: user experience Issues related to reducing friction for users label Feb 19, 2025

github-project-automation bot added this to ExecuTorch DevX Feb 19, 2025

github-project-automation bot moved this to To triage in ExecuTorch DevX Feb 19, 2025

guangy10 mentioned this issue Feb 19, 2025

Report avg_inference_latency from Android benchmark app #8578

Open

guangy10 unassigned guangy10 and kirklandsign Feb 19, 2025

huydhn mentioned this issue Feb 19, 2025

Measure generate_time on iOS benchmark #8580

Merged

huydhn added a commit to huydhn/test-infra that referenced this issue Feb 21, 2025

Hide generate_time(ms) metric temporarily

fc50381

Due to pytorch/executorch#8576 (comment)

huydhn mentioned this issue Feb 21, 2025

Hide generate_time(ms) metric temporarily pytorch/test-infra#6314

Merged

swolchok added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

guangy10 commented Feb 19, 2025 •

edited

Loading

guangy10 commented Feb 19, 2025

guangy10 commented Feb 19, 2025

huydhn commented Feb 19, 2025 •

edited

Loading

shoumikhin commented Feb 20, 2025

guangy10 commented Feb 20, 2025 •

edited

Loading

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

Comments

guangy10 commented Feb 19, 2025 • edited Loading

🐛 Describe the bug

Versions

guangy10 commented Feb 19, 2025

guangy10 commented Feb 19, 2025

huydhn commented Feb 19, 2025 • edited Loading

shoumikhin commented Feb 20, 2025

guangy10 commented Feb 20, 2025 • edited Loading

guangy10 commented Feb 19, 2025 •

edited

Loading

huydhn commented Feb 19, 2025 •

edited

Loading

guangy10 commented Feb 20, 2025 •

edited

Loading