You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently investigating the performance of vLLM in scenarios involving long-texts(16k).
I am conducting inference tests using vLLM and employing hardware comprising 1 to 2 Nvidia A6000 GPUs (48GB).
The test model employed is vicuna-13b-v.5-16k.
The primary focus is on evaluating single-GPU inference and Tensor Parallel (TP).
Based on my current observations, it seems that the maximum acceptable input length on a single GPU is approximately 9-10k tokens. When using two GPUs and enabling TP=2, it appears that the maximum input length hasn't increased significantly. However, there is an improvement in batch size, increasing from 1 to 5.
I'm uncertain whether my testing results align with the expected performance of vLLM, as my understanding of its core mechanisms isn't yet clear.
Additionally, As an exploration and discussion.
whats the possible ways, under the framework of vLLM, to extend the acceptable input length.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am currently investigating the performance of vLLM in scenarios involving long-texts(16k).
I am conducting inference tests using vLLM and employing hardware comprising 1 to 2 Nvidia A6000 GPUs (48GB).
The test model employed is
vicuna-13b-v.5-16k
.The primary focus is on evaluating single-GPU inference and Tensor Parallel (TP).
Based on my current observations, it seems that the maximum acceptable input length on a single GPU is approximately 9-10k tokens. When using two GPUs and enabling
TP=2
, it appears that the maximum input length hasn't increased significantly. However, there is an improvement in batch size, increasing from 1 to 5.I'm uncertain whether my testing results align with the expected performance of vLLM, as my understanding of its core mechanisms isn't yet clear.
Additionally, As an exploration and discussion.
whats the possible ways, under the framework of vLLM, to extend the acceptable input length.
Beta Was this translation helpful? Give feedback.
All reactions