Performance of VLLM in Ultra-Long Text Scenarios #711

Zhuqln · 2023-08-09T04:20:15Z

Zhuqln
Aug 9, 2023

I am currently investigating the performance of vLLM in scenarios involving long-texts(16k).

I am conducting inference tests using vLLM and employing hardware comprising 1 to 2 Nvidia A6000 GPUs (48GB).
The test model employed is vicuna-13b-v.5-16k.
The primary focus is on evaluating single-GPU inference and Tensor Parallel (TP).

Based on my current observations, it seems that the maximum acceptable input length on a single GPU is approximately 9-10k tokens. When using two GPUs and enabling TP=2, it appears that the maximum input length hasn't increased significantly. However, there is an improvement in batch size, increasing from 1 to 5.

I'm uncertain whether my testing results align with the expected performance of vLLM, as my understanding of its core mechanisms isn't yet clear.

Additionally, As an exploration and discussion.
whats the possible ways, under the framework of vLLM, to extend the acceptable input length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of VLLM in Ultra-Long Text Scenarios #711

{{title}}

Replies: 0 comments

Select a reply

Performance of VLLM in Ultra-Long Text Scenarios #711

Zhuqln Aug 9, 2023

Replies: 0 comments

Zhuqln
Aug 9, 2023