Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes delayed sampling for sequential requests #845

Merged

Conversation

attafosu
Copy link

@attafosu attafosu commented Feb 20, 2025

  • Previously LLM.generate() could not be called multiple times with delayed sampling enabled.
  • This also was the case with step() calls
  • Issue occurs when after the last (batch) request is finished, and we're starting a new request, but cached_step_inputs and cached_step_outputs still contain elements saved from the last served (batch) request. This shouldn't be the case.
  • The cleanest solution would be to skip appending to cached_step_inputs/outputs if the recently generated output is the final token generated for the current batch request. But couldn't find a cleaner way to check for this in the model runner.
  • So we instead check (in _patch_prev_output) for when the scheduler context has empty output_queue, which means no pending outputs to patch.

Tests here: https://github.com/habana-internal/mlperf_inference/pull/158

@attafosu attafosu requested a review from mswiniarsk February 20, 2025 16:31
@tianmu-li tianmu-li merged commit 6eeefdd into HabanaAI:mlperf_features Feb 20, 2025
3 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants