Support disaggregated prefill ? #708

artetaout · 2025-01-16T08:05:44Z

I saw your code referring to PD disaggragate. Please tell me how to use it

hiworldwzj · 2025-01-23T02:48:15Z

demo start args.

pd master

python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b
--run_mode "pd_master"
--host hostname -i
--port 60011

prefill node

nvidia-cuda-mps-control -d
KV_TRANS_USE_P2P=1 LOADWORKER=1 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b
--run_mode "prefill"
--host hostname -i
--port 8017
--tp 4
--nccl_port 2732
--max_total_token_num 400000
--tokenizer_mode fast
--pd_master_ip 10.121.4.14 \
--pd_master_port 60011
--use_dynamic_prompt_cache
--max_req_total_len 16000
--running_max_req_size 128
--disable_cudagraph

decode node

nvidia-cuda-mps-control -d
CUDA_VISIBLE_DEVICES=4,5,6,7 KV_TRANS_USE_P2P=1 LOADWORKER=10 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b
--run_mode "decode"
--host hostname -i
--port 8118
--nccl_port 12322
--tp 4
--max_total_token_num 400000
--graph_max_len_in_batch 2048
--graph_max_batch_size 16
--tokenizer_mode fast
--pd_master_ip 10.121.4.14
--pd_master_port 60011
--use_dynamic_prompt_cache

not all model and run mode suppport pd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disaggregated prefill ? #708

Support disaggregated prefill ? #708

artetaout commented Jan 16, 2025

hiworldwzj commented Jan 23, 2025

Support disaggregated prefill ? #708

Support disaggregated prefill ? #708

Comments

artetaout commented Jan 16, 2025

hiworldwzj commented Jan 23, 2025

pd master

prefill node

decode node