Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Another attempt at v1 HPU integration #831

Draft
wants to merge 17 commits into
base: habana_main
Choose a base branch
from

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Feb 14, 2025

Follow-up to previous v1 PRs, but with codebase up to date to Feb 14. Lots has changed and the existing older v1 code needs major adjustements too.
Not much is working so far compared to previous versions. This is very much WIP.

  • Implemented v1 HPU attn backend, worker, model_runner and executor
  • VLLM_USE_V1=1 properly selects V1 HPU components
  • V1 HPU executor loads model properly
  • V1 HPU executor allocates KV cache properly
  • V1 HPU model runner is constructed properly and initializes bucketing
  • V1 HPU attention backend gets selected automatically
  • profile_run works on dummy data
  • V1 HPU model_runner prepares input tensors based on SchedulerOutputs (rather than SequenceGroupMetadata)
  • V1 HPU model_runner differentiates prefill and decode sequences
  • V1 HPU model_runner execute_model runs for prefill
  • V1 HPU model_runner execute_model runs for decode
  • V1 HPU model_runner handles mixed-batch scenarios
  • V1 HPU model_runner prefill returns correct results
  • V1 HPU model_runner decode returns correct results (w/ flat PA)
  • V1 HPU model_runner decode returns correct results (w/ contiguous PA)
  • V1 HPU model_runner prefill runs at BS>1
  • V1 standard greedy and random sampling work on HPU
  • Capturing and replaying HPU Graphs work
  • Llama3.1-8B runs on GSM-8k with SOTA accuracy
  • V1 HPU model_runner warmup works properly
  • V1 HPU automatic prefix caching works properly
  • Tensor parallelism works
  • torch.compile works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant