-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization with truncation, offset support and tests #47
Conversation
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
6a521af
to
4fbef92
Compare
batch_encoding = self.tokenizer.encode_plus( | ||
text=req.text, | ||
return_offsets_mapping=request.return_offsets | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill, here the code is calling encode_plus directly because adding an async wrapper would require changing the vLLM code. But I don't think it has a performance impact because the actual encoding in encode_async not offloaded to a thread pool or anything:
async def encode_async(
self,
prompt: str,
request_id: Optional[str] = None,
lora_request: Optional[LoRARequest] = None) -> List[int]:
tokenizer = await self.get_lora_tokenizer_async(lora_request)
ret = tokenizer.encode(prompt)
self._raise_if_input_too_long(ret, lora_request)
return ret
for 515 |
@prashantgupta24 can you help coordinate getting this into the tgis adapter? |
tokenized results. | ||
""" | ||
# Log the incoming tokenization request for metrics | ||
service_metrics.observe_tokenization_request(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service_metrics.observe_tokenization_request(request) | |
service_metrics.count_tokenization_request(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to be making this change in the adapter repo
Closing since this was ported to the vllm-tgis-adapter repo. Thanks @fialhocoelho. |
…U issue (#47) This PR contains some changes: - It enables the `only_last_token` option when calling FMS. This only copies the logits from the last token back in the first iteration, which leads to faster TTFT. Will need to check if this creates issues for some vLLM features like `prompt_logprobs` at a later stage. - There is a bug in torch 2.3.1 which gives incorrect results on CPU when using prompts of length > 512 tokens. If we set this `attn_algorithm=math` it works around this for this time being. - Change the mask format to what FMS is currently using (note I have note tested the right-padding part of the code, sine we are not using it) - Load the model with precision specified by vLLM user (rather than hard-coded). - Remove some unused stuff. - Refresh aiu_setup to latest version from aiu-fms --------- Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
vllm/tests/test_server_tokenize_truncate.py
tokenizer.encode_plus()
: Thetokenizer.encode_plus()
function is used directly to avoid modifying thetokenizer_group.encode_async
function, which is part of the upstream code/project. This ensures compatibility and maintains the integrity of the upstream codebase. This implementation won't affect performance becausetokenizer_group.encode_async
doesn't runTokenizer.encode
in a background thread.