Fixes delayed sampling for sequential requests #845

attafosu · 2025-02-20T00:57:05Z

Previously LLM.generate() could not be called multiple times with delayed sampling enabled.
This also was the case with step() calls
Issue occurs when after the last (batch) request is finished, and we're starting a new request, but cached_step_inputs and cached_step_outputs still contain elements saved from the last served (batch) request. This shouldn't be the case.
The cleanest solution would be to skip appending to cached_step_inputs/outputs if the recently generated output is the final token generated for the current batch request. But couldn't find a cleaner way to check for this in the model runner.
So we instead check (in _patch_prev_output) for when the scheduler context has empty output_queue, which means no pending outputs to patch.

hans-intel and others added 4 commits February 19, 2025 03:48

fix assert error in second request to LLMEngine

52a9d62

delayed sampling new request fix

931789a

delayed sampling new request fix HabanaAI#3

59fff8a

Sanitize fix for delayed sampling error

47f56e8

attafosu requested review from madamczykhabana, szutenberg and tianmu-li as code owners February 20, 2025 00:57

Merge branch 'mlperf_features' into delayed_sample_iter2_fix

ee0b119

attafosu requested a review from mswiniarsk February 20, 2025 16:31

szutenberg approved these changes Feb 20, 2025

View reviewed changes

tianmu-li merged commit 6eeefdd into HabanaAI:mlperf_features Feb 20, 2025
3 of 22 checks passed

Provide feedback