Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixes delayed sampling for sequential requests (#845)
- Previously LLM.generate() could not be called multiple times with delayed sampling enabled. - This also was the case with step() calls - Issue occurs when after the last (batch) request is finished, and we're starting a new request, but `cached_step_inputs` and `cached_step_outputs` still contain elements saved from the last served (batch) request. This shouldn't be the case. - The cleanest solution would be to skip appending to [`cached_step_inputs/outputs`](https://github.com/HabanaAI/vllm-fork/blob/50b28af6491ed6eb75794d4968fe1c679e65ea92/vllm/worker/hpu_model_runner.py#L2610-L2611) if the recently generated [`output`](https://github.com/HabanaAI/vllm-fork/blob/50b28af6491ed6eb75794d4968fe1c679e65ea92/vllm/worker/hpu_model_runner.py#L2608) is the final token generated for the current batch request. But couldn't find a cleaner way to check for this in the model runner. - So we instead check (in [`_patch_prev_output`](https://github.com/HabanaAI/vllm-fork/blob/50b28af6491ed6eb75794d4968fe1c679e65ea92/vllm/worker/hpu_model_runner.py#L2776)) for when the scheduler context has empty output_queue, which means no pending outputs to patch. Tests here: habana-internal/mlperf_inference#158
- Loading branch information