Add an option to launch cacheflow without ray #51

zhuohan123 · 2023-04-27T10:30:53Z

Fix #23

WoosukKwon

Thanks for the PR. Left some comments. BTW, could you show the latency benchmarks before and after this PR?

.gitignore

cacheflow/master/server.py

zhuohan123 · 2023-04-29T16:05:53Z

Latency with Ray:

ubuntu@ray-zhuohan-cf-head-6dd317a2-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model facebook/opt-13b --use-ray
Namespace(batch_size=8, block_size=16, dtype='half', input_len=32, max_num_batched_tokens=2560, max_num_sequences=256, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', n=1, output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1, use_beam_search=False, use_dummy_weights=False, use_ray=True)
2023-04-29 16:28:04,226 INFO worker.py:1622 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
# GPU blocks: 987, # CPU blocks: 1638
SamplingParams(n=1, temperature=1.0, top_p=1.0, use_beam_search=False, stop_token_ids=set(), max_num_steps=128, num_logprobs=0, context_window_size=None)
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.55s/it]
Avg latency: 3.54734468460083 seconds

Latency without Ray:

ubuntu@ray-zhuohan-cf-head-6dd317a2-compute:~/nfs/cacheflow/cacheflow/benchmark$ python benchmark_latency.py --model facebook/opt-13b
Namespace(batch_size=8, block_size=16, dtype='half', input_len=32, max_num_batched_tokens=2560, max_num_sequences=256, model='facebook/opt-13b', model_path='~/.cacheflow/model_weights', n=1, output_len=128, pipeline_parallel_size=1, seed=0, swap_space=20, tensor_parallel_size=1, use_beam_search=False, use_dummy_weights=False, use_ray=False)
# GPU blocks: 987, # CPU blocks: 1638
SamplingParams(n=1, temperature=1.0, top_p=1.0, use_beam_search=False, stop_token_ids=set(), max_num_steps=128, num_logprobs=0, context_window_size=None)
Warm up step
Profile step: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.32s/it]
Avg latency: 3.3214046160380044 seconds

WoosukKwon

Thanks!

…karound Dockerfile.ubi: remove vllm-nccl workaround

Summary: In Streaming mode, the vllm server returns responses as soon as a token is available. However, it doesn't do it in parts, instead, each response is already an aggregate of all the previous responses. Therefore, it is sufficient to record just the last response. Test: Manual testing Co-authored-by: Varun <varun@neuralmagic.com>

…tch-1 Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…"

Add an option to launch cacheflow without ray

1cddbf4

zhuohan123 requested a review from WoosukKwon April 27, 2023 10:34

WoosukKwon reviewed Apr 28, 2023

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

cacheflow/master/server.py Outdated Show resolved Hide resolved

cacheflow/master/server.py Outdated Show resolved Hide resolved

Fix review issues

1649750

fix

fa0cb93

WoosukKwon approved these changes Apr 30, 2023

View reviewed changes

zhuohan123 merged commit 4858f3b into main Apr 30, 2023

zhuohan123 deleted the no-ray branch May 24, 2023 04:40

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add an option to launch cacheflow without ray (vllm-project#51)

e0770fe

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Jun 13, 2024

Merge pull request vllm-project#51 from dtrifiro/remove-vllm-nccl-wor…

d9ec43c

…karound Dockerfile.ubi: remove vllm-nccl workaround

ZHJ19970917 mentioned this pull request Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024

Merge pull request vllm-project#51 from ROCm/revert-33-divakar-amd-pa…

d124120

…tch-1 Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…"

JHLEE17 pushed a commit to JHLEE17/vllm that referenced this pull request Aug 1, 2024

Add host memory profiling to HabanaMemoryProfiler (vllm-project#51)

cf6952d

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to launch cacheflow without ray #51

Add an option to launch cacheflow without ray #51

zhuohan123 commented Apr 27, 2023 •

edited

Loading

WoosukKwon left a comment

zhuohan123 commented Apr 29, 2023 •

edited

Loading

WoosukKwon left a comment

Add an option to launch cacheflow without ray #51

Add an option to launch cacheflow without ray #51

Conversation

zhuohan123 commented Apr 27, 2023 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

zhuohan123 commented Apr 29, 2023 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

zhuohan123 commented Apr 27, 2023 •

edited

Loading

zhuohan123 commented Apr 29, 2023 •

edited

Loading