How to Truncate the input prompt? #2443
-
We are using vicuna model and setting the maximum length to 4096. llm = LLM(model="lmsys/vicuna-7b-v1.5", max_model_len=4096, max_num_batched_tokens = 4096, tensor_parallel_size=2) We are sending input prompt of more than 10K, and wanted it to be truncated to 4096. But we are getting erorr "Input prompt (25597 tokens) is too long and exceeds limit of 4096". How can we set truncate to true in vllm? Output of get tokenizer: Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Same question. |
Beta Was this translation helpful? Give feedback.
-
You can encode prompt with tokenizer and call llm = LLM(model="lmsys/vicuna-7b-v1.5", max_model_len=4096, max_num_batched_tokens = 4096, tensor_parallel_size=2)
tokenizer = llm.get_tokenizer()
prompt_token_ids = tokenizer.encode("<PROMPT>", return_tensors="pt")
# Truncate prompt_token_ids
prompt_token_ids = prompt_token_ids[-MAX_PROMPT_TOKEN:]
llm.generate(prompt_token_ids=prompt_token_ids) |
Beta Was this translation helpful? Give feedback.
Thank you for your correction! llm.generate accepts a List[List[int]] type instead of a torch.Tensor.
This code was tested on my environment with the latest main branch, and it should work.