Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug the optimal upper-bound performance for swapping (0-cost swapping). #46

Closed
zhuohan123 opened this issue Apr 22, 2023 · 4 comments
Closed
Assignees
Labels
performance Performance-related issues stale

Comments

@zhuohan123
Copy link
Member

Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.

@hmellor
Copy link
Collaborator

hmellor commented Mar 6, 2024

@zhuohan123 is this work still planned or can the issue be closed?

@hmellor
Copy link
Collaborator

hmellor commented Apr 18, 2024

@WoosukKwon?

@DarkLight1337 DarkLight1337 added the performance Performance-related issues label May 31, 2024
dtrifiro pushed a commit to dtrifiro/vllm that referenced this issue Jun 10, 2024
Sync with upstream@v0.4.3-53-g89c92078
fxmarty pushed a commit to fxmarty/vllm-public that referenced this issue Jun 12, 2024
…lm-project#46)

* Update fp8_gemm_tuner.py exchange import torch and hipbsolidxgemm

ImportError: libc10.so: cannot open shared object file: No such file or directory

https://stackoverflow.com/a/65710714

* run isort on fp9_gemm_tuner.py

* add # isort: split

* fix yapf

---------

Co-authored-by: charlifu <charlifu@amd.com>
yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
…t#46)

Summary:
Add benchmarking workflow and action that runs the benchmarks on a manual
trigger.

Test:
Try it locally.
Successful GHA Benchmark Run -
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/8019392326

---------

Co-authored-by: varun <varun@varuns-MacBook-Pro.local>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
JHLEE17 pushed a commit to JHLEE17/vllm that referenced this issue Aug 1, 2024
* Fix setup.py for HPU

* Fix  vllm._C import ops -> vllm.hpu import ops

* more of the same thing

* re-add hpex rmsnorm and rope; but rope is crashing

* remove unnecessary comments

* add vllm/hpu files

* add hpu autodetection

* Add HabanaAttention stub

* revert accidental changes

* revert non-habana backend attention changes

* add habana attention/worker/executor, sampling fails now

* Restore unnecessarily changed files

* enable HabanaMemoryProfiler

* Make sampler pass

* restore habana fused rope

* prefill is now working!!!

* fix prefill padding; decode is now working!!!!!

* revert accidental changes

* remove unused stuff in habana_paged_attn.py

* remove diagnostic stuff from llm_engine.py

* use HabanaExecutorAsync in async_llm_engine.py

* add habana copyright headers to habana_*.py files

* fix prefill attention conformance

* minor naming fixes

* remove naive attention from habana_attn (it never worked anyway)

* re-enable profile run

* Add fake HPUGraph support

* add more metrics

* indentation fix

* ~~recipe cache metrics don't work lalalala~~

* i'm done with metrics for now

* fix corner case in which hl-smi is not available but synapse is

* FIXME: temporary setup.py workaround

* WIP: add tensor parallelism stubs

* habana worker cleanup

* tensor parallelism is now working

* remove unused files

* remove unused func

* add hpugraphrunner

* improve hpu layernorm

* Port pipelined PA

* Port context length bucketing

* remove cudagraphrunner from hpu runner

* restore HPUGraphRunner back from FakeHPUGraphRunner

* handle rotary embeddings properly on gaudi3

* oopsie! captured_block_counts was incorrect!

* captured_block_counts.append doesn't do anything

* Restore habana_main KV cache memory layout

* fix memory profiler

* overhaul hpugraph capture

* Enable attention tests

* Add geneeric changes

* Enable activation tests

* Enable cache tests: reshape & cache

* Enable layernorm tests

* Decouple reshape_and_cache prompt and decode tests and change slot mapping generation in prompt tests

* Decrease max seq len in attention UTs

* Enable pos_encoding tests

* Enable cache copy tests

* Remove gpu migration from unit tests

* skip incompatible on HPU tests

* Fix noisy lines

* Update sampling_metadata.py

Outdated changes

* Update test_cache.py; fix code style

* fix attention test after rebase

* disable rotary embedding tests for hpu

* restore oryginal rotary embedding tests

* disable multiple sampling test

* disable all metrics tests

* disable some models tests

* disable some sampler tests

* restore recently disabled tests

---------

Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: Tomasz Krupa <tkrupa@habana.ai>
Co-authored-by: Artur Fierka <afierka@habana.ai>
@alixiaodi alixiaodi mentioned this issue Aug 2, 2024
Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 31, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues stale
Projects
None yet
Development

No branches or pull requests

4 participants