-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug the optimal upper-bound performance for swapping (0-cost swapping). #46
Comments
@zhuohan123 is this work still planned or can the issue be closed? |
Sync with upstream@v0.4.3-53-g89c92078
…lm-project#46) * Update fp8_gemm_tuner.py exchange import torch and hipbsolidxgemm ImportError: libc10.so: cannot open shared object file: No such file or directory https://stackoverflow.com/a/65710714 * run isort on fp9_gemm_tuner.py * add # isort: split * fix yapf --------- Co-authored-by: charlifu <charlifu@amd.com>
…t#46) Summary: Add benchmarking workflow and action that runs the benchmarks on a manual trigger. Test: Try it locally. Successful GHA Benchmark Run - https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/8019392326 --------- Co-authored-by: varun <varun@varuns-MacBook-Pro.local> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * Enable attention tests * Add geneeric changes * Enable activation tests * Enable cache tests: reshape & cache * Enable layernorm tests * Decouple reshape_and_cache prompt and decode tests and change slot mapping generation in prompt tests * Decrease max seq len in attention UTs * Enable pos_encoding tests * Enable cache copy tests * Remove gpu migration from unit tests * skip incompatible on HPU tests * Fix noisy lines * Update sampling_metadata.py Outdated changes * Update test_cache.py; fix code style * fix attention test after rebase * disable rotary embedding tests for hpu * restore oryginal rotary embedding tests * disable multiple sampling test * disable all metrics tests * disable some models tests * disable some sampler tests * restore recently disabled tests --------- Co-authored-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Tomasz Krupa <tkrupa@habana.ai> Co-authored-by: Artur Fierka <afierka@habana.ai>
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.
The text was updated successfully, but these errors were encountered: