Debug the optimal upper-bound performance for swapping (0-cost swapping). #46

zhuohan123 · 2023-04-22T03:57:07Z

Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.

hmellor · 2024-03-06T10:18:54Z

@zhuohan123 is this work still planned or can the issue be closed?

hmellor · 2024-04-18T14:16:16Z

@WoosukKwon?

Sync with upstream@v0.4.3-53-g89c92078

…lm-project#46) * Update fp8_gemm_tuner.py exchange import torch and hipbsolidxgemm ImportError: libc10.so: cannot open shared object file: No such file or directory https://stackoverflow.com/a/65710714 * run isort on fp9_gemm_tuner.py * add # isort: split * fix yapf --------- Co-authored-by: charlifu <charlifu@amd.com>

…t#46) Summary: Add benchmarking workflow and action that runs the benchmarks on a manual trigger. Test: Try it locally. Successful GHA Benchmark Run - https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/8019392326 --------- Co-authored-by: varun <varun@varuns-MacBook-Pro.local> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * Enable attention tests * Add geneeric changes * Enable activation tests * Enable cache tests: reshape & cache * Enable layernorm tests * Decouple reshape_and_cache prompt and decode tests and change slot mapping generation in prompt tests * Decrease max seq len in attention UTs * Enable pos_encoding tests * Enable cache copy tests * Remove gpu migration from unit tests * skip incompatible on HPU tests * Fix noisy lines * Update sampling_metadata.py Outdated changes * Update test_cache.py; fix code style * fix attention test after rebase * disable rotary embedding tests for hpu * restore oryginal rotary embedding tests * disable multiple sampling test * disable all metrics tests * disable some models tests * disable some sampler tests * restore recently disabled tests --------- Co-authored-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Tomasz Krupa <tkrupa@habana.ai> Co-authored-by: Artur Fierka <afierka@habana.ai>

github-actions · 2024-10-31T02:01:14Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-30T02:03:37Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

WoosukKwon self-assigned this Apr 25, 2023

shanshanpt mentioned this issue Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this issue Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

DarkLight1337 added the performance Performance-related issues label May 31, 2024

yuhuixu1993 mentioned this issue Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

dtrifiro pushed a commit to dtrifiro/vllm that referenced this issue Jun 10, 2024

Merge pull request vllm-project#46 from vllm-project/main

5f62558

Sync with upstream@v0.4.3-53-g89c92078

ZHJ19970917 mentioned this issue Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

alixiaodi mentioned this issue Aug 2, 2024

[Bug]: #7072

Closed

github-actions bot added the stale label Oct 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug the optimal upper-bound performance for swapping (0-cost swapping). #46

Debug the optimal upper-bound performance for swapping (0-cost swapping). #46

zhuohan123 commented Apr 22, 2023

hmellor commented Mar 6, 2024

hmellor commented Apr 18, 2024

github-actions bot commented Oct 31, 2024

github-actions bot commented Nov 30, 2024

Debug the optimal upper-bound performance for swapping (0-cost swapping). #46

Debug the optimal upper-bound performance for swapping (0-cost swapping). #46

Comments

zhuohan123 commented Apr 22, 2023

hmellor commented Mar 6, 2024

hmellor commented Apr 18, 2024

github-actions bot commented Oct 31, 2024

github-actions bot commented Nov 30, 2024